Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and device for semi-supervised field word mining and classification

A domain word and semi-supervised technology, applied in text database clustering/classification, character and pattern recognition, text database query, etc., can solve problems such as poor effect and difficulty in obtaining labeled corpus

Active Publication Date: 2020-04-10
广东惠禾科技发展有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, supervision requires a large amount of labeled corpus, and labeled corpus is actually difficult to obtain, so the actual use effect is not good.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for semi-supervised field word mining and classification
  • A method and device for semi-supervised field word mining and classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0080] Embodiment 1 of the present invention discloses a method of semi-supervised field word mining and classification, such as figure 1 shown, including the following steps:

[0081] Step 101, perform word segmentation and syntactic analysis on the text data in the field to be processed, and obtain the word vector matrix of all words in the text data based on the result of the word segmentation;

[0082] Specifically, in the field of medicine, for example, text data can be obtained from medical websites through web crawlers, etc. Text data in other fields is similar, as long as the corresponding text data can be obtained, it is not limited to specific methods.

[0083] After obtaining the text data, word segmentation and syntactic analysis will be performed;

[0084] As for the "obtaining the word vector matrix of all words in the text data based on the result of the word segmentation" in the above steps includes:

[0085] Obtaining the result of word segmentation of the t...

Embodiment 2

[0115] Embodiment 2 of the present invention discloses a semi-supervised field word mining and classification equipment, such as figure 2 shown, including:

[0116] An acquisition module 201, configured to perform word segmentation and syntactic analysis on the text data in the field to be processed, and obtain word vector matrices of all words in the text data based on the result of the word segmentation;

[0117] The construction module 202 is used to start with a certain number of seed words artificially constructed in the text data, expand the seed words based on the part-of-speech and syntactic composition mode of the seed words in the text data, and use word frequency, part-of-speech , word vectors to filter the seed words to obtain the seed vocabulary;

[0118] Generating module 203, for described seed vocabulary, utilize word vector, knowledge base, statistical feature etc. to determine the general similarity of any two words, and generate word similarity matrix with...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the invention disclose a semi-supervised field word mining and classifying method and equipment. The method comprises the following steps of: preprocessing a field-related corpus and constructing a seed word list and a word similarity matrix; mining candidate field words and determining similarity distribution of the candidate field words; and carrying out category classification onthe screened field words. According to the method and the equipment, a semi-supervised manner is adopted, so that the field word mining and classification can be completed on the basis of common field texts and a small amount of seed word tables without a large amount of tagged data.

Description

technical field [0001] The invention relates to the field of domain word mining and classification, in particular to a method and equipment for semi-supervised domain word mining and classification. Background technique [0002] Domain words are the characteristics that can best represent the characteristics of the domain and distinguish other domains, and domain words can be divided into different category labels according to different roles in the domain. Domain words and their categories constitute the basic vocabulary data of the domain; therefore, domain words The mining and classification of Chinese information processing is an important basic work in Chinese information processing. In many Chinese information processing projects (such as automatic question answering, automatic summarization, automatic classification, search engines, etc.), domain word mining and classification problems will be involved. [0003] At present, the mining and classification algorithms of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/33G06F16/35G06K9/62
Inventor 高登科姚佳
Owner 广东惠禾科技发展有限公司