Short text garbage identification and modeling method and device

A recognition method and short text technology, applied in the Internet field, can solve the problems of low recognition accuracy of garbage recognition methods, sparse effective feature values, and reduced accuracy.

Active Publication Date: 2013-10-02
MICRO DREAM TECHTRONIC NETWORK TECH CHINACO
View PDF2 Cites 39 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] But, in practical application, the inventor of the present invention finds, because SNS website is due to its social attribute, the short text on SNS website is generally brief in content, and the words in the word set extracted based on such brief content are few, thus obtain The effective eigenvalues ​​in the word feature vector of the short text are very sparse, and sometimes there may be only 1 or 2 effective eigenvalues ​​in the word feature vector of the short text; The accuracy of the attribution judgment is greatly reduced; that is, the recognition accuracy of the spam identification method for short text content in the current prior art is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text garbage identification and modeling method and device
  • Short text garbage identification and modeling method and device
  • Short text garbage identification and modeling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings and preferred embodiments. However, it should be noted that many of the details listed in the specification are only for readers to have a thorough understanding of one or more aspects of the present invention, and these aspects of the present invention can be implemented even without these specific details.

[0069] As used herein, terms such as "module" and "system" are intended to include computer-related entities such as, but not limited to, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a module may be, but is not limited to being limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and / or a computer. For example, both an applicatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text garbage identification and modeling method and device. The short text garbage identification and modeling method includes the steps that word segmentation is conducted on a short text to be determined, word sets are acquired, and garbage features of the short text to be determined are analyzed to acquire analytical information; the analytical information of the short text to be determined and each word in the word sets are compared with feature elements in predetermined feature element sets respectively, and word feature vectors of the short text to be determined are generated according to feature values of words or the analytical information matched with the feature elements in the feature element sets; whether the short text to be determined is a garbage text or not is determined according to the word feature vectors of the short text to be determined and classification models; the classification models are trained in advance, wherein the classification models combine the number of samples with centralized training and select a proper classification algorithm. Due to the fact that the word feature vectors of the feature values of the analytical information are expanded to conduct garbage identification, the identified accuracy rate for identifying the garbage texts is improved.

Description

technical field [0001] The invention relates to the field of the Internet, in particular to a short text garbage identification and modeling method and device. Background technique [0002] With the rapid development of Internet technology, the explosive growth of online information; with the accelerated pace of life and work, people are more and more inclined to use short text to communicate. SNS (Social Network Service, social network service) websites, represented by twitter (Twitter) and Sina Weibo, which use smaller short texts to produce, organize and disseminate information, have won the favor of netizens. [0003] At present, the main method for automatic garbage identification of short text content on the Internet is to use a classification model-based method to classify a certain short text content as junk text or non-junk text; the method includes: training phase and Classification stage. [0004] In the training phase, modeling is carried out based on a large n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 姜贵彬
Owner MICRO DREAM TECHTRONIC NETWORK TECH CHINACO
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products