Mass text deduplication screening method and device and storage medium
Patent Information
- Authority / Receiving Office
- CN Β· China
- Current Assignee / Owner
- SUZHOU LANGDONG NET TEC CO LTD
- Publication Date
- 2020-02-25
- Estimated Expiration
- Not applicable Β· inactive patent
Smart Images

Figure 1 
Figure 2 
Figure 3
Abstract
Description
technical field
[0001] The invention relates to the technical field of the Internet, in particular to a method, device and storage medium for deduplication and screening of massive texts. Background technique
[0002] In the Internet age, information is growing explosively. A piece of news will be reprinted, modified, and edited by various media. Text deduplication is to identify similar and repeated information. Commonly used text deduplication algorithms include simhash (a type of local sensitive hashing), cosine similarity, etc.
[0003] The comparison speed of simhash is relatively fast, and it can greatly improve the overall performance in the massive text deduplication task, but the accuracy and recall rate are average, reaching about 80%, that is to say, the similarity of 20% of the text will be different. was wrongly judged. Contents of the invention
[0004] The purpose of the present invention is to provide a method, device and storage medium for deduplication ...