Semi-automatic word segmentation corpus labeling and training device

A training device and semi-automatic technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve the problem of organizing various language information into machines that can be directly read, reducing labor costs and improving efficiency and accuracy, reducing the effect of complexity

Active Publication Date: 2019-09-27
10TH RES INST OF CETC
View PDF8 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the word segmentation system based on comprehension is still in the experimental stage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semi-automatic word segmentation corpus labeling and training device
  • Semi-automatic word segmentation corpus labeling and training device
  • Semi-automatic word segmentation corpus labeling and training device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] See figure. In the preferred embodiment described below, a semi-automatic word segmentation corpus labeling training device includes: a text corpus labeling preparation module, a semi-automatic corpus word segmentation labeling module, a feedback model learning training module and a word segmentation labeling model effect evaluation module, which The feature is that the text corpus labeling preparation module provides preparation for labeling tasks. By distinguishing data from different sources and selecting corpus sources, pre-labeling the corpus data to be labeled according to the source or subject is performed for a single word segmentation, and the corpus to be labeled and word segmentation are realized. Data management, and then through multiple word segmentation algorithms such as bidirectional maximum matching word segmentation based on integrated dictionaries, conditional random field CRF, JIEBA, bidirectional LSTM network, BI-LSTM, etc., submit the raw corpus wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a semi-automatic word segmentation corpus labeling and training device, which aims to overcome the defects of the corpora used during the word segmentation corpus labeling and training process. The device of the invention is realized through the following technical schemes of using a text corpus annotation preparation module for managing the to-be-annotated corpora and the segmented word corpora; based on a plurality of word segmentation algorithms, such as the bidirectional maximum matching word segmentation based on an integrated dictionary, CRF, JIEBA, etc., submitting the word segmentation annotation work of the raw corpus to a semi-automatic corpus word segmentation annotation module; creating the segmented word tagging tasks, selecting a labeling applicable algorithm model, carrying out the automatic annotations, on the basis of automatic labeling result fusion, feeding back a training model corpus and a labeling model generated by the text corpus labeling preparation module to the feedback model learning training module; selecting and carrying out model learning training, calling a unified training model interface to generate a core dictionary, updating a word segmentation training model table, establishing a labeling algorithm comprehensive evaluation model to evaluate a model labeling effect, so that a new word segmentation labeling task is completed.

Description

technical field [0001] The invention relates to the technical field of text mining, in particular to a semi-automatic labeling training device for word segmentation data. Background technique [0002] Words are the smallest, independently active, and meaningful language components, but there are no obvious distinguishing marks between words in Chinese. Therefore, Chinese word analysis is the basis and key of Chinese information processing. The accuracy of word segmentation is closely related to the accuracy of part-of-speech tagging. Organically integrating the process of word segmentation and part-of-speech tagging is conducive to eliminating ambiguity and improving overall efficiency. A Chinese sentence is composed of consecutive words, and there is no space separation between words. Part-of-speech tagging refers to the process of determining an appropriate part-of-speech for each word in a sentence. Chinese word segmentation is the first "process" of Chinese information...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/211G06F40/289G06F18/214
Inventor 代翔崔莹黄细凤孙涛李强
Owner 10TH RES INST OF CETC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products