Text classification method based on multiword text representation strategy
A text classification and text representation technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as reducing computing costs, and achieve the effect of reducing the amount of computing
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0032] A text classification method based on a text representation strategy, comprising the steps of:
[0033] Step 1: Select a public text classification dataset and use information gain (IG) to reduce the text dimension, and test the robustness of different classification methods by removing the percentage of non-informative single phrases.
[0034] Step 2: Extract multiple words from the text in the data set and store them in the corpus;
[0035] Step 3: Use different text representation strategies to process multiple words to form a complete feature set of the text, and finally evaluate the effectiveness of multi-word representations in different strategies;
[0036]In the above step 1, the public text classification data set selects reuters-21578, which contains a total of 19403 valid texts, and each text has an average of 5.4 sentences. For convenience, "grain", "crude oil", "trading" and " The texts of the four categories of "interest" are used as the target data set, ...
Embodiment 2
[0050] This embodiment is basically the same as Embodiment 1, the difference is:
[0051] In step 3, text classification adopts combination strategy. In this strategy, only long multiwords are used to represent the general concept of the text, and two parameters are set to determine whether multiwords appear in the document that can replace combining single phrases. Given a pattern p, a text t and a fixed quantity k, k is independent of the lengths of p and t, a k-mismatch in p is a substring of |p|-t matching p's (|p|-k) characters. That is, it matches p and does not match k. The main idea of the algorithm is to set k as a dynamic threshold cutoff according to the length of the pattern p, and find the minimum range of occurrences of words in the pattern p in t. Specifically, two parameters are set to determine whether multiple words occur in a document. The first is the occurrence ratio (OR), which is the ratio of the current number of individual phrases of pattern p to...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


