Supercharge Your Innovation With Domain-Expert AI Agents!

Text sequence labeling algorithm using overlapping splitting rule

A text sequence and sequence labeling technology, applied in computing, instrumentation, electrical and digital data processing, etc., can solve problems such as time-consuming, large model, low computational space efficiency, etc., to improve processing efficiency, good application, and improve model prediction. effect of effect

Active Publication Date: 2020-03-27
朱利
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0026] In addition, due to the inherent autoregressive characteristics of the cyclic neural network, loop iterations are required; if the sentence length is too long, it will take a lot of time, which is not allowed in engineering
[0027] 2. For feature extractors such as CNN and Transformer, forced truncation will directly reduce the effect
On the contrary, if a larger maximum sequence length is designed for the model, the model will be too large and the calculation space efficiency will be low.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text sequence labeling algorithm using overlapping splitting rule
  • Text sequence labeling algorithm using overlapping splitting rule
  • Text sequence labeling algorithm using overlapping splitting rule

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] Overlap splitting: Assuming that the maximum sentence length is 10 and the length of the overlapping part is 3, the following sentence can be divided into several short sentences.

[0053] Example sentence 1: One of the most important tasks is the water safety and convenience of residents, see Table 2.

[0054] Table 2 example sentence 1 is a case demonstration of overlapping split

[0055] That middle one item very Heavy want of work do At once yes live in civil of use water install Complete 。 That middle one item very Heavy want of work do of work do At once yes live in civil of use water of use water install Complete 。

[0056] Therefore, after splitting, it becomes the above four clauses, and sentences that all meet the maximum sequence length of the model can be obtained, which can solve t...

Embodiment 2

[0060] Description: An entity (or vocabulary) contains another entity (or vocabulary), that is, there is a containment relationship.

[0061] If there are entities (or vocabulary) in the overlapping parts of the two sentences that are taken to the truncation boundary (B, E, S label), they will be merged directly and the longer entity (or vocabulary) will be taken. This can be aimed at the three tasks of word segmentation, part-of-speech tagging, and named entity recognition. (1) The following example 2 named entity recognition results, "Guiyang City Big Data Center" covers "Big Data Center", take the longer entity "Guiyang City Big Data Center", see Table 3.

[0062] Table 3. Example 2. Case demonstration of overlapping splitting

[0063] Token Overlap 1 Overlap 2 expensive O State O exist O expensive B-Organization O Positive I-Organization O city I-Organization O Big I-Organization B-Organization nu...

Embodiment 3

[0069] If only one of the overlapping parts of the two sentences has an entity (or vocabulary) that reaches the truncation boundary (B, E, S label), remove the entity (or vocabulary) and then merge.

[0070] (1) The results of named entity recognition in Example 4 are as follows: the characters "政" and "市" are the initial characters and the last characters of the two overlapping parts respectively. One of them has an entity and the other does not, so the "government procurement The complete entity "net" is ignored and then merged.

[0071] Table 5 Example 4 for a case demonstration of overlapping split

[0072] Token Overlap 1 Overlap 2 expensive B-Organization State I-Organization Province I-Organization politics I-Organization B-Organization government I-Organization I-Organization Pick I-Organization I-Organization purchase I-Organization I-Organization network E-Organization E-Organizatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text sequence labeling algorithm using an overlapping splitting rule, being characterized in that a sequence labeling task model based on deep learning is completed empirically, and the maximum sequence length of the sequence labeling task model is fixed as a finite value in a training stage; in the prediction stage, the natural text sequence length often exceeds the maximum sequence length of the model, and at the moment, the F1 value of the model is reduced. An overlapping splitting rule mode is adopted, when the length of a text to be predicted exceeds the maximum sequence length of a model, an ultra-long text is split into a plurality of sub-sequences with the length not larger than the maximum sequence length, and overlapping areas are arranged between the sub-sequences, and overlapping splitting processing is carried out. The overlapping splitting mode rule can be suitable for different types of feature extractor models, and the model prediction effect can be improved to a certain extent, meanwhile, for the RNN feature extractor, the processing efficiency can be greatly improved, and the text sequence labeling algorithm using an overlapping splittingrule is widely and very well applied to completed sequence labeling task engineering.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a natural language sequence tagging algorithm, such as word segmentation, part-of-speech tagging, named entity recognition, etc., and further relates to a text sequence tagging algorithm using overlapping splitting rules. Background technique [0002] Most of the knowledge and information in human society are recorded in the form of language and characters created by humans, and computers can store and record texts conveniently and quickly. However, computers can only transmit and store information, and cannot directly recognize, understand, and use language. Natural language processing is an algorithmic technique for processing human natural language text. [0003] Among them, Words Segmentation, POS Tagging and Named Entity Recognition are the basic tasks of natural language processing. [0004] 1) Segmentation, which divides a sentence (sequence of word...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/289G06F40/295G06F40/253
CPCY02D10/00
Inventor 朱利崔诚煜李元伟陈杭
Owner 朱利
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More