Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text categorization method and device

A text classification and text technology, applied in the field of Internet information, can solve the problems of low accuracy, lower accuracy, and affect the accuracy of machine classification, and achieve the effect of improving recognition accuracy and accuracy

Active Publication Date: 2014-01-15
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF4 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

On the one hand, during word segmentation processing, punctuation marks are generally filtered out and will not be returned as word segmentation results, making it impossible to judge these junk texts mixed with punctuation marks; on the other hand, punctuation marks and stop words are not Does not reflect semantics, appears in normal text and spam text with similar frequency, cannot effectively support posterior probability, thus affecting the accuracy of machine classification
2. For the main components of the text are URL links, QQ numbers, mobile phone numbers, etc., the classification effect is not very good, because word segmentation cannot cut out valid text content, and the accuracy rate is not high
3. The judging effect of meaningless answers is not good. For example, when users cheat by avatar ads, they will send a lot of comments such as "good experience" and "good effect, very good"
When a large number of such texts appear in the training corpus of junk text, it will also have a certain impact on the classification effect of normal comments, resulting in a decrease in accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text categorization method and device
  • Text categorization method and device
  • Text categorization method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0074] figure 1 Is a flowchart of the text classification method provided in this embodiment, such as figure 1 As shown, the method includes:

[0075] S101. Replace each character in the text to be processed with the exception of characters and numbers with a preset fixed character string.

[0076] First, escape the specific symbols in the text to be processed, such as English symbols "-_`~#$%^&*()+=|\" and Chinese symbols "《》¥()——·?" The characters "\n\t\r\n" and spaces are replaced with fixed characters.

[0077] The fixed character string can be, but is not limited to, using the same characters repeatedly and superimposed into a character string with a length of more than one. For example, a fixed string "$$$$" with four "$" characters superimposed and so on. The fixed character string "$$$$" is used to replace every character in the text to be processed except for words and numbers. For example, for " > > " / " This pending text is replaced with the fixed string "$$$$", and bec...

Embodiment 2

[0095] figure 2 Is a flowchart of the text classification method provided in this embodiment, such as figure 2 As shown, the method includes:

[0096] In step S201, characters other than characters and numbers in the text to be processed are replaced with a preset fixed character string.

[0097] This step is the same as step S101 in the first embodiment, and will not be repeated here.

[0098] Step S202: Count the total length of the replaced text and the length of the text contained in the text, and calculate the text ratio weight by using the ratio of the text length to the total length of the text.

[0099] The calculation method of the ratio K of the text length to the total text length is the same as that of step S102 in the first embodiment, that is, K=L_CHAR / L_ORIG.

[0100] Using the ratio of the text length to the total length of the text to calculate the text ratio weight Score_char, the following formula can be used but not limited to:

[0101] Score _ char = 2 ...

Embodiment 3

[0127] In this embodiment, the Bayes dictionary, Fisher dictionary, user name dictionary, and IP dictionary are constructed in advance by the way of offline generation dictionary. The specific establishment methods include:

[0128] Step S301: Obtain sample corpus including normal text and junk text.

[0129] The sample corpus may use a certain scale of existing historical data, and use the text, comments or replies submitted by different user names or IP addresses accumulated in the network to form the sample corpus.

[0130] The normal text and junk text obtained can be classified by using existing classification methods, or it can be obtained by manual labeling, to distinguish the text in the sample corpus that is marked as junk text by the administrator or other users, and Unmarked normal text.

[0131] Step S302: Perform word segmentation processing on the text in the sample corpus, count each word item, calculate the probability that each word item is normal text and junk text, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a text categorization method and device. The method comprises the steps of replacing characters, except words and numbers, in texts to be processed with preset fixed strings, determining the total length of the texts after replacement and the length of the words contained in the texts and calculating the ratio of the length of the words to the total length of the texts, calculating cheating characteristic indexes of the texts to be processed according to the ratio of the length of the words to the total length of the texts, and determining that the texts to be processed with the cheating characteristic indexes exceeding a preset threshold are garbage texts. The text categorization method and device can effectively make up for the deficiency of existing machine learning methods and improve the accuracy of categorization.

Description

【Technical Field】 [0001] The present invention relates to the field of Internet information technology, in particular to a text classification method and device. 【Background technique】 [0002] With the continuous development of the Internet, more and more users use the Internet for information exchange and resource sharing, and the amount of network information is also increasing. However, the openness of the Internet has also led to the existence of a lot of bad information in the Internet. Therefore, it has become a common demand to monitor, filter and classify Internet information. [0003] Comments (also called messages, replies, etc.) are an important function of Internet community products, and an important channel for forming a product interaction atmosphere. Because of its low publishing cost, wide audience, and long-lasting effect, it has been plagued by spam from the beginning of the comment function, including various advertising links, promotional information, and por...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/353
Inventor 程童
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More