Clustering method and system aiming at massive similar short texts

A short text and text technology, applied in the field of clustering and system for massive similar short texts

Inactive Publication Date: 2011-09-14
BEIJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These text messages are all original text messages. Although the expressions are different, because the content is on the same topic, they have great similarities.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method and system aiming at massive similar short texts
  • Clustering method and system aiming at massive similar short texts
  • Clustering method and system aiming at massive similar short texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] In order to process massive network data, the above solutions must be deployed in a distributed manner. Each distributed processing node obtains data from the short text data source, and after extracting the short text trunk, communicates with the HASH database server, and searches the short text trunk in the HASH database to determine whether the short text is repeated. The number of such short texts is updated in the local TokyoCabinet HASH table, and the processing results are transmitted to subsequent processes for further processing. At the same time, in order to improve the processing speed, two cache structures of BUFFER_DEQUE and DB_DEQUE are used on each processing node to make a secondary cache for the repeated text category information in the HASH server.

[0040] 1. The structure needs to be explained

[0041] 1) The reason why the processing node sets the cache

[0042] In order to ensure high read performance of the hash server, it is very important to l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a clustering method and system aiming at massive similar short texts, belonging to a research on repeated short text detection in the scientific field of information technology. Due to self features of the short texts, the calculated result obtained by applying the traditional repeated text analysis method to short texts are not satisfactory. By adopting a repeated analysis method based on main short text content and combining related word groups, the invention not only can detect completely repeated texts, but also can detect texts with extremely high similarity. The method and system disclosed by the invention have high processing speed and high efficiencyand can better process massive data. By the adoption of the method, redundant short texts can be removed, the system processing scale can be greatly decreased, and hot short texts can be found to a certain extent. therefore, the method and system disclosed by the invention are helpful to find out social hotspots.

Description

1. Technical field [0001] information Technology 2. Background technology [0002] Under the background that informatization has become the development trend of the world, the Internet has many characteristics such as extremely wide application, the largest development scale, and being very close to people's lives. On the one hand, the Internet has created huge economic and social benefits, enabling people to receive instant and up-to-date news; Acquisition, storage, and real-time analysis and processing capabilities pose serious challenges, and also bring certain difficulties to the accuracy and reliability of people's search for information; on the other hand, the Internet has also brought some negative effects, such as pornography, Reactionary and other bad information is widely disseminated on the Internet. The proliferation of improper activities such as spam, the use of the Internet to spread copyright infringement such as movies, music, and software, and even defrau...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 白俊良陈光
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products