Text sequence annotating system and method based on Bi-LSTM (Bidirectional Long Short-Term Memory) and CRF (Conditional Random Field)

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A text sequence and sequence technology, applied in the information field, can solve the problems of poor Chinese word segmentation, cost a lot of manpower and material resources, and rely on the selection of features, so as to reduce the cost of manpower labeling and improve efficiency

Active Publication Date: 2018-01-23

武汉烽火普天信息技术有限公司

View PDF6 Cites 60 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] The current Chinese sequence labeling mainly has the following four problems in the application: first, the effect of Chinese word segmentation is not very good, such as the name "Wang Baoquan", if there is no name database or special processing (regularization or other grammatical processing), the name will be segmented into "wang" and "baoquan"

Second, most of the current Chinese sequence tagging methods adopt traditional methods such as Hidden Markov Model (HMM) or Conditional Random Field (CRF), although the effect obtained by adding a suitable lexicon is also It’s okay, but HMM’s ability to describe the sequence as a whole is weak, and CRF relies too much on the selection of features

Fourth, for different businesses, when it comes to different fields, it is necessary to manually label a corpus for model training. Since the model has particularly high requirements for training corpus, a large amount of accurately labeled corpus data is required, which will cost a lot of money. Human and material resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0034] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0035] Such as figure 1 As shown, the Bi-LSTM and CRF-based text sequence labeling system of the present invention includes: a learning module 1 and a labeling module 2, the learning module 1 is used to input the acquired corpus into a preset learning model, according to The sequence classification results output by the learning model add corresponding prediction labels to the acquired corpus, and use the artificial labels to minimize and optimize the loss function of the learning model to fit the matching between the prediction labels and the artificial labels. The corpus provided to the label...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text sequence annotating system and method based on a Bi-LSTM (Bidirectional Long Short-Term Memory) and a CRF (Conditional Random Field). The system comprises a learning module and an annotating module, wherein the annotating module comprises a word segmenting module, a corpus annotating module and an adjusting and optimizing module; and the corpus annotating module comprises a part-of-speech annotating module and an entity recognizing module. The method comprises the following steps: preprocessing an obtained corpus; inputting the preprocessed corpus into a preset learning model; adjusting and saving parameters of the learning model; adding a corresponding prediction tag for the corpus respectively according to a sequence classifying result output by the learning model; performing word segmentation on an unknown corpus; initially annotating the unknown corpus being subjected to the word segmentation by using the adjusted learning model; adjusting and optimizing the initially-annotated unknow corpus; and finally annotating the adjusted and optimized corpus. Through adoption of the text sequence annotating system and method, a user can adjust a word library as required; a human-computer interactive adjusting function is realized; a process of automatically annotating in the same field and semi-automatically annotating in different fields is realized; the efficiency is increased; and the cost is lowered.

Description

technical field [0001] The invention relates to the field of information technology, in particular to a text sequence tagging system and method based on Bi-LSTM and CRF. Background technique [0002] With the development of the Internet, mobile Internet and big data technology, the scale of various text data resources has shown explosive growth, mainly including social media (such as Weibo account, official account, Facebook, Twitter, etc.) and news media (such as People's Daily , Phoenix News, Sohu News, etc.) websites, as well as semi-structured data on encyclopedia websites such as Baidu Encyclopedia and Wikipedia, Natural Language Processing (Natural Language Processing, NLP) plays a very important role in the process of text information extraction. important role. In the process of text mining, how to extract useful information from massive text data is of great value to enterprises or users. Sequence annotation is one of the most basic and commonly used NLP methods. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27G06F17/30G06N3/04G06N3/08

Inventor金勇吴兵朱阳光李力

Owner武汉烽火普天信息技术有限公司

Text sequence annotating system and method based on Bi-LSTM (Bidirectional Long Short-Term Memory) and CRF (Conditional Random Field)

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology