New word discovery-based cross-domain Chinese word segmentation system and method

A new word discovery, Chinese word segmentation technology, applied in semantic analysis, instruments, biological neural network models, etc., can solve problems such as time-consuming, unrealistic, and difficult to achieve good results in cross-domain Chinese word segmentation

Active Publication Date: 2021-07-06
SOUTH CHINA UNIV OF TECH
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The best way to solve the expression gap is to manually label the corpus of the target domain, and then mix the corpus of the two domains to retrain the model. However, large-scale manual labeling requires a lot of manpower and material resources, and it is impossible for all The domains are manually marked, so it is not feasible; the best way to solve the unregistered words is to let professionals extract words that have never appeared in the source domain corpus from the target domain corpus, and put these words as training corpus However, selecting these unregistered words requires a lot of manpower and material resources on the one hand, and on the other hand, because various new words emerge in an endless stream in today's society, it is impossible to rely on manpower to select all unregistered words
Therefore, cross-domain Chinese word segmentation has been difficult to achieve better results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New word discovery-based cross-domain Chinese word segmentation system and method
  • New word discovery-based cross-domain Chinese word segmentation system and method
  • New word discovery-based cross-domain Chinese word segmentation system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0079] The structural block diagram of the cross-domain Chinese word segmentation system based on new word discovery disclosed in this embodiment is as follows: figure 1 As shown, it is composed of a new word discovery module, an automatic tagging module and a cross-domain word segmentation module. The new word discovery module, the automatic tagging module and the cross-domain word segmentation module are connected in sequence, and are used to mine data from unlabeled target domain corpus. New words, automatic labeling of unlabeled target domain corpus, and training of neural networks for cross-domain Chinese word segmentation.

[0080] In the present embodiment, the block diagram of the new word discovery module structure is as figure 2 As shown, it is composed of the candidate word extraction submodule, the enhanced mutual information extraction submodule, the adjacency entropy extraction submodule and the candidate word filtering submodule, wherein the candidate word extr...

Embodiment 2

[0087] This embodiment provides a cross-domain Chinese word segmentation method based on the cross-domain Chinese word segmentation system based on new word discovery, and adopts the following steps to realize the word segmentation of corpus in different fields:

[0088] Step S1: Use the new word discovery module to mine the vocabulary of new words in the target field from the corpus. In the above step S1, use the new word discovery module to mine the new word vocabulary of the field from the target field corpus, including the following steps:

[0089] Step S1.1: Use the candidate word extraction sub-module to extract all candidate words whose length does not exceed n from the unlabeled target domain corpus.

[0090] In this embodiment, the candidate word extraction submodule splits the corpus according to non-Chinese characters, sets the maximum candidate word length to 6, and extracts all candidate words whose length does not exceed 6 from the sentences of the segmented corp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cross-domain Chinese word segmentation system and method based on neologism discovery. The system comprises: a neologism discovery module which achieves a neologism discovery algorithm through the use of enhanced mutual information combining statistical information and semantic information, and is used for mining a neologism list from an unlabeled corpus; an automatic tagging module, which is used for realizing initial segmentation of untagged corpora by using a new word list in combination with a reverse maximum matching algorithm to obtain incompletely segmented corpora, and completely segmenting the incompletely segmented corpora by using a word segmentation model to obtain automatically tagged corpora; a cross-domain word segmentation module, which is used for realizing a cross-domain Chinese word segmentation algorithm by using an adversarial method, and carrying out adversarial training by using the labeled source domain corpus and the automatically labeled corpus. According to the method, a new word discovery algorithm is optimized by using enhanced mutual information, and the accuracy of new word discovery and the domain of a word list are improved; in a cross-domain word segmentation algorithm, the utilization rate of unlabeled corpora is improved, and the recall rate and accuracy of word segmentation are optimized.

Description

technical field [0001] The invention relates to the technical field of natural language, in particular to a cross-domain Chinese word segmentation system and method based on new word discovery. Background technique [0002] Chinese texts use Chinese characters as the smallest writing unit, Chinese characters are combined to form words, and finally Chinese texts are composed of words. Words are the smallest structural units in Chinese text that contain semantic information and can be used independently. However, unlike English and other languages, there are no explicit separators between Chinese words. Certain technical methods are used to divide Chinese text into words to facilitate computer understanding. , this process is Chinese word segmentation. Chinese word segmentation is the most basic task in Chinese natural language processing, and it is the cornerstone of natural language processing tasks such as text classification, text generation and sentiment analysis. There...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F40/30G06N3/04
CPCG06F40/289G06F40/30G06N3/045
Inventor 张军李学宁更新杨萃冯义志余华陈芳炯季飞
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products