Clustering method and system aiming at massive similar short texts

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A short text and text technology, applied in the field of clustering and system for massive similar short texts

Inactive Publication Date: 2011-09-14

BEIJING UNIV OF POSTS & TELECOMM

View PDF0 Cites 32 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

These text messages are all original text messages. Although the expressions are different, because the content is on the same topic, they have great similarities.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0039] In order to process massive network data, the above solutions must be deployed in a distributed manner. Each distributed processing node obtains data from the short text data source, and after extracting the short text trunk, communicates with the HASH database server, and searches the short text trunk in the HASH database to determine whether the short text is repeated. The number of such short texts is updated in the local TokyoCabinet HASH table, and the processing results are transmitted to subsequent processes for further processing. At the same time, in order to improve the processing speed, two cache structures of BUFFER_DEQUE and DB_DEQUE are used on each processing node to make a secondary cache for the repeated text category information in the HASH server.

[0040] 1. The structure needs to be explained

[0041] 1) The reason why the processing node sets the cache

[0042] In order to ensure high read performance of the hash server, it is very important to l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a clustering method and system aiming at massive similar short texts, belonging to a research on repeated short text detection in the scientific field of information technology. Due to self features of the short texts, the calculated result obtained by applying the traditional repeated text analysis method to short texts are not satisfactory. By adopting a repeated analysis method based on main short text content and combining related word groups, the invention not only can detect completely repeated texts, but also can detect texts with extremely high similarity. The method and system disclosed by the invention have high processing speed and high efficiencyand can better process massive data. By the adoption of the method, redundant short texts can be removed, the system processing scale can be greatly decreased, and hot short texts can be found to a certain extent. therefore, the method and system disclosed by the invention are helpful to find out social hotspots.

Description

1. Technical field [0001] information Technology 2. Background technology [0002] Under the background that informatization has become the development trend of the world, the Internet has many characteristics such as extremely wide application, the largest development scale, and being very close to people's lives. On the one hand, the Internet has created huge economic and social benefits, enabling people to receive instant and up-to-date news; Acquisition, storage, and real-time analysis and processing capabilities pose serious challenges, and also bring certain difficulties to the accuracy and reliability of people's search for information; on the other hand, the Internet has also brought some negative effects, such as pornography, Reactionary and other bad information is widely disseminated on the Internet. The proliferation of improper activities such as spam, the use of the Internet to spread copyright infringement such as movies, music, and software, and even defrau...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor白俊良陈光

OwnerBEIJING UNIV OF POSTS & TELECOMM

Clustering method and system aiming at massive similar short texts

What is AI technical title? AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document. A short text and text technology, applied in the field of clustering and system for massive similar short texts

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A short text and text technology, applied in the field of clustering and system for massive similar short texts

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology