Corpus pre-processing method, corpus pre-tagging method and electronic device

A technology for preprocessing and preprocessing results, applied in natural language data processing, electrical digital data processing, special data processing applications, etc. It can solve the problems of wasting manpower and affecting labeling efficiency, and achieve the effect of reducing manual processing work.

Active Publication Date: 2019-03-08
XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Most of the labeling work is based on manual labeling. In most cases, the corpus has not been processed in advance, and there will be a large amount of duplicate data. If these duplicate data are not filtered, one will affect the efficiency of labeling, and the other will be a waste of manpower.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus pre-processing method, corpus pre-tagging method and electronic device
  • Corpus pre-processing method, corpus pre-tagging method and electronic device
  • Corpus pre-processing method, corpus pre-tagging method and electronic device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the following will describe each embodiment of the present invention in detail with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present invention, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.

[0019] The first embodiment provided by the present invention is a text processing method.

[0020] Please refer to figure 1 , figure 1 A flow chart of the corpus preprocessing method provided by the first embodiment of the present invention is shown.

[0021] Such as figure 1 As shown, the method for corpus prepro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a natural language processing technology, and provides a corpus pre-processing method. The method comprises the following steps of vectorizing each corpus to obtain a text vector of the corpus; clustering based on the text vector of the corpus, and determining a special corpus from the corpus; identifying a named entity to the special corpus, and determining a named entitycontained in the special corpus; classifying the special corpus based on a target named entity; extracting a first preset number of the dedicated corpus from the dedicated corpus of each of the classifications as the result of the preprocessing. Based on the method provided by the embodiment, a large amount of repetitive target corpus can be eliminated by pre-processing the original corpus data for subsequent manual annotation or other processing, so that the repetitive manual processing work can be greatly reduced.

Description

technical field [0001] The invention relates to natural language processing technology, in particular to a corpus preprocessing method, a corpus pre-marking method and electronic equipment. Background technique [0002] Corpus is the basic resource of corpus linguistics research and the main resource of empirical language research methods. Traditional corpora are mainly used in dictionary compilation, language teaching, traditional language research, statistical or case-based research in natural language processing, etc. With the development of Internet big data and artificial intelligence technology, corpus has also been widely used. [0003] The corpus has three characteristics. The corpus stores the language materials that have actually appeared in the actual use of the language, such as user messages and customer service dialogues obtained directly from the web page. The corpus is the basic resource for carrying language knowledge, but it does not mean Language knowled...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/332G06F16/35G06F17/27
CPCG06F40/295
Inventor 林志伟肖龙源蔡振华李稀敏刘晓葳谭玉坤
Owner XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products