A method and device for retrieving similar articles
An article, hash value technology, applied in the Internet field, can solve problems such as the inability to calculate the number of similar articles
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0088] Such as figure 1 As shown, this embodiment provides a method for retrieving similar articles, including:
[0089] Step S101: Obtain a target article.
[0090] In a specific implementation process, the target article (namely: a given article in the background art) can be any article (for example: a piece of news, or a microblog, or a post on a post bar, etc.).
[0091] Step S102: Calculate the simhash hash value of the target article.
[0092] In the specific implementation process, simhash is a very influential and widely used method for deduplication of large-scale similar articles in the industry. Specifically, simhash is a calculation method of a string hash value, which is used to map a string of any length into a 64-bit integer signature value, which is characterized by the simhash hash calculated by strings with similar characteristics The Hamming distance of the value (the number of bits whose corresponding bits of the two codewords have different values is ...
example 1
[0125] First, look for the simhash hash value whose Hamming distance to the simhash hash value of the target article is less than or equal to 3 in the simhash hash value table, if only one simhash hash value is found, then form a set of this simhash hash value S; Then, find the node corresponding to the simhash hash value in the set S in the parental forest data model (for example: corresponding to the leaf node I); finally, obtain the root node corresponding to the leaf node I (ie: root node A). Here, since only one root node (ie: root node A) is obtained in the parental forest data model based on the set S, the tree (ie: tree 1) corresponding to the root node (ie: root node A) is used as the target Tree.
[0126] 【Example 2】
[0127] First, look for the simhash hash value whose Hamming distance to the simhash hash value of the target article is less than or equal to 3 in the simhash hash value table, if two simhash hash values are found, then combine the two simhash hash ...
Embodiment 2
[0152] Based on the same inventive idea, such as Figure 4 As shown, this embodiment provides a device for retrieving similar articles, including:
[0153] The first acquiring unit 401 is configured to acquire the target article;
[0154] The first calculation unit 402 is used to calculate the simhash hash value of the target article;
[0155] The second obtaining unit 403 is used to obtain the parental forest data model; the parental forest data model is composed of multiple trees, wherein each leaf node represents the simhash hash value of an article, and each root node represents the entire tree simhash hash value, the Hamming distance of the simhash hash value between any two root nodes is greater than the preset value, and the Hamming distance of the simhash hash value of any two nodes on the same tree is less than or equal to the preset value set value;
[0156] The retrieval unit 404 is configured to retrieve articles similar to the target article in the parent fores...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


