Tourism named entity identification method based on BBLC model

A named entity recognition and model technology, applied in the field of semantic recognition, can solve the problems of lack of feature marks in Chinese words, inability to distinguish the ambiguity of the same word, difficulty in named entity recognition, etc., and achieve high F value, good effect, and strong generalization ability. Effect

Active Publication Date: 2020-06-19
SHAANXI NORMAL UNIV
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004](1) Entities are constantly being upgraded, unregistered words are constantly appearing, and it is difficult to enumerate them using dictionaries
For example, it is unrealistic to put all the names of cities and all people in the dictionary at present. With the development of the times, the continuous generation of new words will bring greater difficulties to named entity recognition;
[0005](2) There is no obvious characteristic mark between Chinese words, unlike English, which has spaces and upper and lower case as distinctions;
[0007](...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Tourism named entity identification method based on BBLC model
  • Tourism named entity identification method based on BBLC model
  • Tourism named entity identification method based on BBLC model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0061] This embodiment uses BIO notation to mark entities. BIO rules are: B represents the beginning of an entity, I represents the middle entity and O represents other non-entity words.

[0062] The authoritative SIGHAN 2006 off-3 corpus only contains the annotations of three entities, PER, LOC and ORG respectively represent the name of the person, the name of the location and the name of the organization; it will take time to fully characterize the various entities of the tourism data , time and other marked entities, so this paper chooses to crawl travel notes, strategies, comments and other text data of travel websites such as Ctrip and Mafengwo. A total of 13464 Chinese sentences are used as experimental data after BIO marking of 15431 entities in five categories in the text The corresponding sets, labels and entities are shown in Table 1:

[0063] Table 1 Entity label set

[0064]

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a tourism named entity recognition method based on a BBLC model, and the method comprises the steps: carrying out the BIO marking of statements in a corpus, and obtaining a BIOmarking set; inputting the BIO annotation set into a BERT pre-training language model, and outputting vector representation of each word in the statement, namely a word embedding sequence in each statement; 3, taking the word embedding sequence as the input of each time step of the bidirectional LSTM, and carrying out the further semantic coding to obtain a statement feature matrix; taking the statement feature matrix as input of a CRF model, labeling and decoding the statement x to obtain a word label sequence of the statement x, outputting a probability value that a label of the statement xis equal to y, solving an optimal path by using a dynamically planned Viterbi algorithm, and outputting a label sequence with the maximum probability. According to the method, local context information can be obtained by adding the BERT pre-training language model, the accuracy, recall rate and F value are higher, the generalization ability and robustness are stronger, and the defects of a traditional model can be overcome.

Description

technical field [0001] The invention belongs to the technical field of semantic recognition, and relates to a method for recognizing tourism named entities based on a BBLC model. Background technique [0002] With the rise of the tourism industry, the volume of tourism data has become larger and larger. While enriching the field, the complexity of information acquisition caused by massive data greatly reduces the efficiency of people's information acquisition. Obtaining more useful travel information in a short time has become an important demand for tourism in the era of big data. A large amount of structured information on existing travel websites provides people with great convenience, but there are more information that can better reflect the user's tendency in texts such as travel notes, strategies, and comments, so from unstructured Extracting useful information from the text is the focus of research, and its essence is to improve the efficiency of named entity recog...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/295G06N3/04
CPCG06N3/044G06N3/045
Inventor 薛乐义曹菡李鹏
Owner SHAANXI NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products