A method and device for semi-supervised field word mining and classification
A domain word and semi-supervised technology, applied in text database clustering/classification, character and pattern recognition, text database query, etc., can solve problems such as poor effect and difficulty in obtaining labeled corpus
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0080] Embodiment 1 of the present invention discloses a method of semi-supervised field word mining and classification, such as figure 1 shown, including the following steps:
[0081] Step 101, perform word segmentation and syntactic analysis on the text data in the field to be processed, and obtain the word vector matrix of all words in the text data based on the result of the word segmentation;
[0082] Specifically, in the field of medicine, for example, text data can be obtained from medical websites through web crawlers, etc. Text data in other fields is similar, as long as the corresponding text data can be obtained, it is not limited to specific methods.
[0083] After obtaining the text data, word segmentation and syntactic analysis will be performed;
[0084] As for the "obtaining the word vector matrix of all words in the text data based on the result of the word segmentation" in the above steps includes:
[0085] Obtaining the result of word segmentation of the t...
Embodiment 2
[0115] Embodiment 2 of the present invention discloses a semi-supervised field word mining and classification equipment, such as figure 2 shown, including:
[0116] An acquisition module 201, configured to perform word segmentation and syntactic analysis on the text data in the field to be processed, and obtain word vector matrices of all words in the text data based on the result of the word segmentation;
[0117] The construction module 202 is used to start with a certain number of seed words artificially constructed in the text data, expand the seed words based on the part-of-speech and syntactic composition mode of the seed words in the text data, and use word frequency, part-of-speech , word vectors to filter the seed words to obtain the seed vocabulary;
[0118] Generating module 203, for described seed vocabulary, utilize word vector, knowledge base, statistical feature etc. to determine the general similarity of any two words, and generate word similarity matrix with...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 

