Keyword Corpus Labeling Training Extraction System

A technology for extracting system and keywords, applied in natural language data processing, instruments, computing, etc., can solve the problems of lack of keyword corpus, the effect is not as good as supervised methods, and the labeling efficiency is low, so as to reduce labor cost and improve corpus. The effect of labeling efficiency and reducing complexity

Active Publication Date: 2022-07-08
10TH RES INST OF CETC
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Both unsupervised and supervised methods have their own advantages and disadvantages: the unsupervised method does not need to manually label the training set, so it is faster, but because it cannot comprehensively use multiple information to rank candidate words, the effect may not be as good as the supervised method; The supervised method can adjust the influence of various information on judging keywords through training and learning, so the effect is better. However, in today's data age, labeling training sets is very time-consuming and labor-intensive.
The disadvantage of the supervised text keyword extraction algorithm is that it requires high labor costs
The third category is to achieve the effect of recognizing words by letting the computer simulate the human understanding of the sentence. Due to the complexity of Chinese semantics, it is difficult to organize various language information into a form that can be recognized by the machine. Due to the need to mark a large number of training corpora, the use of Manual methods are time-consuming and labor-intensive, and this word segmentation system is still in the experimental stage
At present, the keyword corpus in the field is relatively scarce, and the keyword corpus labeling work is currently mainly done through manual labeling. There are widespread problems such as poor quality of corpus labeling, cumbersome labeling process, low labeling efficiency, and high human resource costs.
At the same time, the existing keyword corpus tagging system has disadvantages such as a single tagging method, and it is difficult to automatically update the tagging method model. Therefore, there is an urgent need for a semi-automatic keyword tagging and training platform that can assist manual tagging of corpus to solve the above problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword Corpus Labeling Training Extraction System
  • Keyword Corpus Labeling Training Extraction System
  • Keyword Corpus Labeling Training Extraction System

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] see figure 1 . In the preferred embodiment described below, a keyword corpus tagging training and extraction system includes: a keyword corpus tagging preparation module, a semi-automatic corpus keyword tagging module, a feedback keyword tagging model learning and training module, and a keyword tagging model The effect evaluation module, wherein: the keyword corpus labeling preparation module distinguishes the massive corpus data from different sources, selects the source of the keyword corpus for the keyword corpus of different purposes, and sets it as the corpus to be labeled for different purposes, Instant corpus; the semi-automatic corpus keyword tagging module first creates a keyword tagging task, and further selects an adaptive algorithm according to different tagging usage requirements and corpus characteristics, and conducts automatic tagging based on algorithm models. By integrating CHI, LDA, graph-based Sorted keyword extraction algorithm, at least one keywor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a keyword corpus labeling training extraction tool, which aims to provide a labeling training tool that can reduce the complexity of manual labeling process and can improve the labeling efficiency and accuracy of massive keyword corpus. The present invention is realized through the following technical solutions: the keyword corpus labeling preparation module distinguishes massive corpus data from different sources, the semi-automatic corpus keyword labeling module creates a keyword labeling task, independently selects an adaptation algorithm and carries out algorithm model-based Automatic labeling, by integrating at least one keyword extraction algorithm from CHI, LDA, TEXTRANK, and TFIDF, pre-labeling the text corpus data to be labeled, and merging the labeling results of various algorithms. When the labeling task is completed, the feedback key The keyword tagging model learning and training module trains the keyword tagging algorithm model; the keyword tagging model effect evaluation module automatically evaluates the quantitative tagging effect of the model indicators.

Description

technical field [0001] The invention relates to the technical field of text mining, in particular to a semi-automatic labeling training and extraction system for keyword corpus. Background technique [0002] In the field of natural language processing, the key to processing massive text files is to extract the issues that users care about most. Whether it is long text or short text, it is often possible to spy on the theme of the entire text through a few keywords. At the same time, whether it is text-based recommendation or text-based search, there is also a great dependence on text keywords, and the accuracy of keyword extraction is directly related to the final effect of the recommendation system or search system. Therefore, keyword extraction is an important part in the field of text mining. The rapid development of the Internet has provided people with easy access to information, and the number of electronic documents such as web pages, emails, and e-books is increasi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06F40/211G06K9/62
CPCG06F40/211G06F40/289G06F18/214
Inventor 崔莹代翔黄细凤王侃杨拓余博朱宇涛李超李源源
Owner 10TH RES INST OF CETC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products