Unlock instant, AI-driven research and patent intelligence for your innovation.

Real-time text clustering method based on Jaccard distance

A text clustering and distance technology, which is applied in the fields of natural language processing and big data, can solve problems such as slow processing speed and low accuracy of real-time clustering, and achieve the effects of improving operational efficiency, user experience, and improving results

Pending Publication Date: 2020-08-14
武汉烽火普天信息技术有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to overcome the above-mentioned shortcomings of the prior art, the present invention proposes a real-time text clustering method based on Jaccard distance, which solves the technical problems of low real-time clustering accuracy and slow processing speed of existing mass text data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Real-time text clustering method based on Jaccard distance

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The following examples are presented to illustrate certain embodiments of the invention and should not be construed as limiting the scope of the invention. The content disclosed in the present invention can be improved simultaneously from materials, methods and reaction conditions, and all these improvements should fall within the spirit and scope of the present invention.

[0022] Such as figure 1 As shown, a real-time text clustering method based on Jaccard distance, which specifically includes the following steps:

[0023] S1: Text similarity calculation: select text a and text b from the data to be clustered (news data, WeChat official account data, Weibo data, and post bar data), and calculate the Jaccard distance of text a and text b; Extract keywords Sa and Sb from text a and text b, the number of keywords is 35, and then calculate the intersection |A|=Sa∪Sb between the corresponding keywords of the two texts, and the union |B|=Sa∪Sb, where Jaccard distance (...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a real-time text clustering method based on a Jaccard distance, and the method specifically comprises the following steps: S1, text similarity calculation: selecting two textsfrom to-be-clustered data, extracting keywords from the two texts, calculating the intersection and union set of the keywords corresponding to different texts, and further obtaining the Jaccard distance; s2, setting a hierarchical clustering threshold value; s3, constructing a clustering model, sequentially reading newly loaded data, calculating an average distance between each piece of data andeach class, comparing the average distance with a threshold value, determining whether the classes are clustered or independently classified, and continuously iterating and updating; and S4, writing the clustering result in the S3 into the Hbase and ES database in the form of updating the clustering identifier, wherein the data with the same clustering identifier in the ES database is clustered into one class. According to the text clustering method, real-time analysis of massive text data, similar text clustering and effective deduplication can be realized, the user experience is improved, and meanwhile, the text classification result can be improved.

Description

technical field [0001] The invention relates to the technical fields of natural language processing and big data, in particular to a real-time text clustering method based on Jaccard distance. Background technique [0002] In today's information-explosive society, massive data and information appear every day, and each topic will be mentioned on different platforms or by many people at the same time, so human beings will encounter many repetitions or similar problems while reading information. Data, which is a big obstacle for us to efficiently obtain data information, will waste a lot of time. Therefore, the text clustering method is used to deduplicate massive network text data, and similar data are integrated to form a category, which can be read and processed by classification, thereby greatly improving work efficiency and saving time. [0003] At present, the text similarity distance calculated mainly based on the bag of words model, TF-IDF, and WORD2VEC is used as the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/35G06F40/289G06K9/62
CPCG06F16/35G06F40/289G06F18/22
Inventor 金勇胡华孙涛
Owner 武汉烽火普天信息技术有限公司
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More