Unlock instant, AI-driven research and patent intelligence for your innovation.

Text classification method based on multiword text representation strategy

A text classification and text representation technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as reducing computing costs, and achieve the effect of reducing the amount of computing

Inactive Publication Date: 2018-09-14
DONGHUA UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] A major difficulty in text classification is the high dimensionality of the feature space, and most features (i.e., terms) are irrelevant or redundant to the classification task, making it highly desirable to reduce computational cost without sacrificing classification accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method based on multiword text representation strategy
  • Text classification method based on multiword text representation strategy
  • Text classification method based on multiword text representation strategy

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0032] A text classification method based on a text representation strategy, comprising the steps of:

[0033] Step 1: Select a public text classification dataset and use information gain (IG) to reduce the text dimension, and test the robustness of different classification methods by removing the percentage of non-informative single phrases.

[0034] Step 2: Extract multiple words from the text in the data set and store them in the corpus;

[0035] Step 3: Use different text representation strategies to process multiple words to form a complete feature set of the text, and finally evaluate the effectiveness of multi-word representations in different strategies;

[0036]In the above step 1, the public text classification data set selects reuters-21578, which contains a total of 19403 valid texts, and each text has an average of 5.4 sentences. For convenience, "grain", "crude oil", "trading" and " The texts of the four categories of "interest" are used as the target data set, ...

Embodiment 2

[0050] This embodiment is basically the same as Embodiment 1, the difference is:

[0051] In step 3, text classification adopts combination strategy. In this strategy, only long multiwords are used to represent the general concept of the text, and two parameters are set to determine whether multiwords appear in the document that can replace combining single phrases. Given a pattern p, a text t and a fixed quantity k, k is independent of the lengths of p and t, a k-mismatch in p is a substring of |p|-t matching p's (|p|-k) characters. That is, it matches p and does not match k. The main idea of ​​the algorithm is to set k as a dynamic threshold cutoff according to the length of the pattern p, and find the minimum range of occurrences of words in the pattern p in t. Specifically, two parameters are set to determine whether multiple words occur in a document. The first is the occurrence ratio (OR), which is the ratio of the current number of individual phrases of pattern p to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a text classification method based on a multiword text representation strategy. The method comprises the steps of selecting a public text classification data set, and processing; conducting multiword extraction on a text in the data set, and storing in a corpus; using different text representation strategies to deal with multiword, forming a feature set with the complete text, and finally evaluating the validity of multiword representation in different strategies. The method utilizes regular expression matching to extract the multiword and extract repeated patterns, and reduces the amount of computation of text classification.

Description

technical field [0001] The invention relates to the technical field of text classification, in particular to a text classification method based on a multi-word text representation strategy. Background technique [0002] With the rapid growth of online information, text classification has become one of the key technologies for processing and organizing text data. Intelligent text classification uses supervised learning methods to assign predefined category labels to new text data through training based on a set of labeled text data. Text classification requires text mining techniques, and one of the topics that supports text mining is text representation. Text representation is the process of transforming unstructured text into structured data as a numerical vector that can be processed by data mining techniques, which has a great impact on the generalization accuracy of the learning system. [0003] Usually, the Bag-of-Worlds (BOW) model is used for text classification, wh...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 周武能杜薇
Owner DONGHUA UNIV