Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Mass text deduplication screening method and device and storage medium

A text and massive technology, applied in the Internet field, can solve problems such as general accuracy and recall rate, misjudgment of text similarity, etc., and achieve the effect of high precision of cosine similarity, accurate de-duplication screening, and fast speed

Inactive Publication Date: 2020-02-25
SUZHOU LANGDONG NET TEC CO LTD
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The comparison speed of simhash is relatively fast, and it can greatly improve the overall performance in the massive text deduplication task, but the accuracy and recall rate are average, reaching about 80%, that is to say, the similarity of 20% of the text will be different. misjudged

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass text deduplication screening method and device and storage medium
  • Mass text deduplication screening method and device and storage medium
  • Mass text deduplication screening method and device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The present invention will be described in detail below in conjunction with specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, method, or functional changes made by those skilled in the art according to these embodiments are included in the protection scope of the present invention.

[0030] Such as figure 1 As shown, the present invention provides a method for deduplication screening of massive texts, and the method specifically includes:

[0031] Step S100: Obtain a similar text set according to the simhash value of the text to be screened.

[0032] Before obtaining similar text collections, it is necessary to calculate the simhash value of the text to be filtered.

[0033] The simhash algorithm is a locally sensitive hash algorithm, which can reduce the probability of high-dimensional data and map it to a fingerprint with a small number of digits and a fixed value (fingerprint ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a mass text deduplication screening method and device and a storage medium. The method comprises the steps of obtaining a similar text set according to a simhash value of a to-be-screened text; calculating cosine similarity between the to-be-screened text and texts in the similar text set; and judging whether the to-be-screened text needs to be filtered or not according tothe cosine similarity. Compared with the prior art, the method has the advantages that two text similarity algorithms are integrated, similarity comparison is performed in massive texts by utilizing the characteristics of high speed and high cosine similarity precision of simhash, so that the purpose of news duplicate removal is achieved, and the massive texts can be subjected to rapid and accurate duplicate removal screening.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method, device and storage medium for deduplication and screening of massive texts. Background technique [0002] In the Internet age, information is growing explosively. A piece of news will be reprinted, modified, and edited by various media. Text deduplication is to identify similar and repeated information. Commonly used text deduplication algorithms include simhash (a type of local sensitive hashing), cosine similarity, etc. [0003] The comparison speed of simhash is relatively fast, and it can greatly improve the overall performance in the massive text deduplication task, but the accuracy and recall rate are average, reaching about 80%, that is to say, the similarity of 20% of the text will be different. was wrongly judged. Contents of the invention [0004] The purpose of the present invention is to provide a method, device and storage medium for deduplication ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/335G06F16/31
CPCG06F16/335G06F16/325
Inventor 陈汝龙戴敏龚朝辉
Owner SUZHOU LANGDONG NET TEC CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products