Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-granularity word segmentation method and system based on sequence labeling modeling

A technology of sequence tagging and word segmentation method, applied in biological neural network models, special data processing applications, instruments, etc., can solve problems such as the difficulty of multi-granularity word segmentation, the lack of multi-granularity word segmentation data, and the complexity of the tagging process

Active Publication Date: 2018-02-23
SUZHOU UNIV
View PDF4 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the existing word segmentation data is single-granularity word segmentation data, and there is no method for obtaining multi-granularity word segmentation data, one method is to complete the manual labeling method to obtain multi-granularity word segmentation data
However, the manual tagging method has the following disadvantages: (1) It is very difficult to formulate a multi-granularity word segmentation tagging specification, which is obviously more difficult than making a single-granularity word segmentation tagging specification
(2) The requirements for the annotator are higher, and the annotator needs to learn a more complex annotation specification
(3) The labeling process is more complicated, and the labeling result changes from a sequence structure to a hierarchical structure
In short, the human and time costs of manually labeling multi-granularity word segmentation data are very high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-granularity word segmentation method and system based on sequence labeling modeling
  • Multi-granularity word segmentation method and system based on sequence labeling modeling
  • Multi-granularity word segmentation method and system based on sequence labeling modeling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] In this embodiment, the multi-granularity word segmentation method based on sequence annotation modeling includes:

[0058] Select three single-granularity labeling data sets with different specifications, that is, CTB, PPD, and MSR three word segmentation specifications;

[0059] Sentences in a single-granularity labeling data set are converted into word segmentation sequences that comply with the other two word segmentation specifications, and the converted sentences correspond to three different word segmentation sequences;

[0060] The three word segmentation sequences corresponding to each sentence are converted into a multi-granularity word segmentation hierarchy, and each layer of the multi-granularity word segmentation hierarchy is a sentence, which cannot be further combined with words to form a coarser-grained word, word, word ;

[0061] Determine the multi-granularity label of each word in the multi-granularity word segmentation hierarchy according to the pr...

Embodiment 2

[0085] In this embodiment, the multi-granularity word segmentation method based on sequence annotation modeling includes:

[0086] Select three single-granularity labeling data sets with different specifications, that is, CTB, PPD, and MSR three word segmentation specifications;

[0087] The sentences in the two single-granularity annotation datasets are converted into word segmentation sequences that comply with the other two word segmentation specifications, and the converted sentences correspond to three different word segmentation sequences;

[0088] The three word segmentation sequences corresponding to each sentence are converted into a multi-granularity word segmentation hierarchy, and each layer of the multi-granularity word segmentation hierarchy is a sentence, which cannot be further combined with words to form a coarser-grained word, word, word ;

[0089] Determine the multi-granularity label of each word in the multi-granularity word segmentation hierarchy accordi...

Embodiment 3

[0117] The multi-granularity word segmentation method based on sequence labeling modeling in this embodiment is different from Embodiment 1 in that the acquisition of the multi-granularity word segmentation sequence is different, and the specific word segmentation sequence acquisition includes:

[0118] Select two single-granularity labeling datasets with different specifications, that is, PPD and CTB word segmentation specifications. In this implementation, only the specific conversion results of the sentence "this diving team was established in the mid-1980s" in the PPD into the data under the CTB specification are listed. In this embodiment, similar ones will also comply with the single granularity of the CTB specification. The sentence "re-employment population in the province has increased in recent years" in the labeled data set is transformed into a word segmentation sequence that complies with the PPD specification, that is, the converted sentences in the single-grained...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a multi-granularity word segmentation method and system based on sequence labeling modeling, and provides a method and system for acquiring a multi-granularity label sequenceby means of a machine learning method. The method comprises the steps that sentences in at least one single-granularity labeling data set are converted into word segmentation sequences complying withother n-1 word segmentation specifications respectively, n word segmentation sequences complying with the different specifications and corresponding to each sentence are converted into a multi-granularity word segmentation hierarchical structure, a multi-granularity label of each word in each sentence is obtained according to a predetermined coding method and the multi-granularity word segmentation hierarchical structures, and therefore a multi-granularity label sequence of each sentence is obtained; on the basis of the data set including the sentences and the corresponding multi-granularity label sequences, by training a sequence labeling model, a multi-granularity sequence labeling model is obtained. According to the multi-granularity word segmentation method and system based on sequencelabeling modeling, the concept of multi-granularity word segmentation is put forward for the first time, and the multi-granularity word segmentation hierarchical structures can be quickly and automatically obtained.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a multi-granularity word segmentation method and system based on sequence annotation modeling. Background technique [0002] Traditional word segmentation tasks are single-granularity word segmentation, that is, a continuous word sequence can only be recombined into a unique word sequence according to a specified specification. Multi-granularity word segmentation is to divide a continuous word sequence into multiple word sequences with different granularities according to different specifications. [0003] At present, the word segmentation tasks are all single-granularity word segmentation tasks, and at the same time, the existing manual tagged word segmentation data are all single-granularity word segmentation data. Therefore, there is no multi-granularity word segmentation data at home and abroad. The premise of multi-granularity word segmentation is a mul...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/04
CPCG06N3/049G06F40/284
Inventor 张民李正华龚晨
Owner SUZHOU UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products