Mass short message information filtering method based on semantic extension

A technology of semantic expansion and information filtering, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve problems such as powerlessness, bottlenecks in the running speed of learning algorithms, and small scale, and achieve the effect of improving execution efficiency.

Active Publication Date: 2013-12-18
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF4 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, most of the existing information filtering technologies determine the feature space based on word frequency. This method is more suitable for long text information, while information from Weibo, SMS, news comments, etc. all exist in the form of short text. , the shortness of the text content makes there are few effective features that can be used, and there are few common features between different texts. The feature sparsity of short texts will directly affect the effectiveness of information filtering.
Secondly, the size of the training sample data set used by the existing information filtering technology is relatively small, and the size of the training sample set required for short text information filtering is much larger to ensure that the distribution of the actual data is consistent. , and the running speed of existing learning algorithms on large sample data sets will definitely become a huge bottleneck
Another very important issue is how the filter responds to changing data. Existing information filtering technologies either ignore data changes or use incremental learning strategies to solve them, but these incremental learning strategies are mostly based on The same feature space is carried out, and the change of the actual data is more that the feature space has changed. For the data with this feature space changed, the existing information filtering technology is almost powerless.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass short message information filtering method based on semantic extension

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings.

[0023] Such as figure 1 In this embodiment, the massive short text information filtering method based on semantic extension includes the following steps:

[0024] Step 1. Select the data samples closely related to the information filtering task from the historical data, and manually mark the categories, 0 represents bad information, 1 represents normal information, and establish a training sample set. In order to make this data sample set consistent with the distribution of actual data Basically the same, the scale of this sample set is relatively large. The context information of each sample in the training sample set is extended, that is, the information of a session to which each sample belongs is introduced. The threshold of the a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a mass short message information filtering method based on semantic extension and can solve the short message feather sparsity problem. The method comprises the following steps that 1, an initial training sample set is built, and expansion is carried out for each sample of the training sample set on the basis of the context information; 2, the expanded training sample set is subjected to text preprocessing; 3, a theme feather dictionary is built on the basis of the training sample set subjected to the preprocessing; 4, each text of the training sample set is subjected to text expression in a hidden theme space; 5, an SVM (support vector machine) filter is built; 6, the text to be filtered is subjected to expansion and text preprocessing to be converted into a feather word set on the basis of the context information, is then subjected to text expression in the hidden theme space, and is filtered by a filter; and 7, novel samples are regularly collected, the word item probability distribution of the theme is updated in the existing hidden theme space, the novel samples are subjected to text expression, and the SVM filter is rebuilt.

Description

technical field [0001] The invention belongs to the technical field of information filtering, in particular to a method for filtering massive short text information based on semantic expansion. Background technique [0002] In recent years, new media represented by the Internet and mobile phones are playing an increasingly important role in people's daily life, study and work. People can pay attention to social hotspots and participate in social public affairs through Weibo, SMS, news comments, etc. The powerful communication function and influence of public opinion possessed by new media are extensively and profoundly affecting all aspects of human society. However, with the positive development of new media, there are also some negative phenomena that cannot be ignored. Some people use new media to wantonly spread reactionary speeches, vulgar false information, etc., and some companies or individuals also take the opportunity to distribute a large number of advertisements...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 刘振岩王伟平孟丹王勇康颖
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products