Mass text deduplication screening method and device and storage medium

A text and massive technology, applied in the Internet field, can solve problems such as general accuracy and recall rate, misjudgment of text similarity, etc., and achieve the effect of high precision of cosine similarity, accurate de-duplication screening, and fast speed
CN110837555AInactive Publication Date: 2020-02-25SUZHOU LANGDONG NET TEC CO LTD

Patent Information

Authority / Receiving Office
CN Β· China
Current Assignee / Owner
SUZHOU LANGDONG NET TEC CO LTD
Publication Date
2020-02-25
Estimated Expiration
Not applicable Β· inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a mass text deduplication screening method and device and a storage medium. The method comprises the steps of obtaining a similar text set according to a simhash value of a to-be-screened text; calculating cosine similarity between the to-be-screened text and texts in the similar text set; and judging whether the to-be-screened text needs to be filtered or not according tothe cosine similarity. Compared with the prior art, the method has the advantages that two text similarity algorithms are integrated, similarity comparison is performed in massive texts by utilizing the characteristics of high speed and high cosine similarity precision of simhash, so that the purpose of news duplicate removal is achieved, and the massive texts can be subjected to rapid and accurate duplicate removal screening.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of the Internet, in particular to a method, device and storage medium for deduplication and screening of massive texts. Background technique

[0002] In the Internet age, information is growing explosively. A piece of news will be reprinted, modified, and edited by various media. Text deduplication is to identify similar and repeated information. Commonly used text deduplication algorithms include simhash (a type of local sensitive hashing), cosine similarity, etc.

[0003] The comparison speed of simhash is relatively fast, and it can greatly improve the overall performance in the massive text deduplication task, but the accuracy and recall rate are average, reaching about 80%, that is to say, the similarity of 20% of the text will be different. was wrongly judged. Contents of the invention

[0004] The purpose of the present invention is to provide a method, device and storage medium for deduplication ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More