Text data analysis method and device, server and storage medium
A technology of text data and analysis methods, applied in text database clustering/classification, unstructured text data retrieval, semantic analysis, etc., can solve the problem of reducing the accuracy of text classification, reducing the similarity between text content features and subject words, and topics Words and vocabulary are not comprehensive enough to achieve the effect of improving accuracy
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0028] figure 1 It is a flow chart of a text data analysis method provided by Embodiment 1 of the present invention. This embodiment is applicable to the case of classifying text, and the method can be executed by a text data analysis device. The method specifically includes the following steps:
[0029] Step 110, expand the predetermined subject words, and determine subject words vectors.
[0030] In a specific embodiment of the present invention, the subject headings are a set of subject categories of each text to be classified, such as subjects such as politics, finance and economics, and education. Since there are many words that can represent the subject meaning of the subject heading, it is necessary to expand the subject heading. In this embodiment, each subject term can be matched with each vocabulary in the preset corpus through semantic analysis, and the vocabulary matched with each subject term in the corpus can be used as the extended vocabulary of each subject t...
Embodiment 2
[0044] On the basis of the first embodiment above, this embodiment provides a preferred implementation of a text data analysis method, which can determine the training text feature vector and the test text feature vector according to relatively complete subject word vectors. figure 2 A flow chart of a text data analysis method provided in Embodiment 2 of the present invention, such as figure 2 As shown, the method includes the following specific steps:
[0045] Step 201, matching each subject word with each vocabulary in a preset corpus through semantic analysis.
[0046] In a specific embodiment of the present invention, the corpus is a basic resource of language knowledge carried by an electronic computer, which stores language materials that have actually appeared in the actual use of the language, and needs to be processed to become useful resources. In this embodiment, HowNet Chinese thesaurus (HowNet) can be used as the corpus of extended subject words. Through the m...
Embodiment 3
[0078] image 3 It is a schematic structural diagram of a text data analysis device provided by Embodiment 3 of the present invention. This embodiment is applicable to the situation of classifying texts, and the device can implement the text data analysis method described in any embodiment of the present invention. Specifically, the device includes:
[0079] The subject word vector determination module 310 is used to expand the predetermined subject words and determine the subject word vectors;
[0080] The training text feature vector determination module 320 is used to determine the training text feature vector according to the subject word vector;
[0081] The test text feature vector determination module 330 is used to convert the text to be tested into a test text feature vector according to the subject word vector;
[0082] A classification module 340, configured to classify the text to be tested according to the training text feature vector and the test text feature v...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


