A method and device for retrieving similar articles

An article, hash value technology, applied in the Internet field, can solve problems such as the inability to calculate the number of similar articles

Active Publication Date: 2022-04-26
ADVANCED NEW TECH CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The embodiment of the present invention provides a method and device for retrieving similar articles, which solves the problem that in the prior art, when an article is given, the number of articles similar to the given article cannot be calculated based on simhash, and these similar articles are retrieved. Article technical issues

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for retrieving similar articles
  • A method and device for retrieving similar articles
  • A method and device for retrieving similar articles

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0088] Such as figure 1 As shown, this embodiment provides a method for retrieving similar articles, including:

[0089] Step S101: Obtain a target article.

[0090] In a specific implementation process, the target article (namely: a given article in the background art) can be any article (for example: a piece of news, or a microblog, or a post on a post bar, etc.).

[0091] Step S102: Calculate the simhash hash value of the target article.

[0092] In the specific implementation process, simhash is a very influential and widely used method for deduplication of large-scale similar articles in the industry. Specifically, simhash is a calculation method of a string hash value, which is used to map a string of any length into a 64-bit integer signature value, which is characterized by the simhash hash calculated by strings with similar characteristics The Hamming distance of the value (the number of bits whose corresponding bits of the two codewords have different values ​​is ...

example 1

[0125] First, look for the simhash hash value whose Hamming distance to the simhash hash value of the target article is less than or equal to 3 in the simhash hash value table, if only one simhash hash value is found, then form a set of this simhash hash value S; Then, find the node corresponding to the simhash hash value in the set S in the parental forest data model (for example: corresponding to the leaf node I); finally, obtain the root node corresponding to the leaf node I (ie: root node A). Here, since only one root node (ie: root node A) is obtained in the parental forest data model based on the set S, the tree (ie: tree 1) corresponding to the root node (ie: root node A) is used as the target Tree.

[0126] 【Example 2】

[0127] First, look for the simhash hash value whose Hamming distance to the simhash hash value of the target article is less than or equal to 3 in the simhash hash value table, if two simhash hash values ​​are found, then combine the two simhash hash ...

Embodiment 2

[0152] Based on the same inventive idea, such as Figure 4 As shown, this embodiment provides a device for retrieving similar articles, including:

[0153] The first acquiring unit 401 is configured to acquire the target article;

[0154] The first calculation unit 402 is used to calculate the simhash hash value of the target article;

[0155] The second obtaining unit 403 is used to obtain the parental forest data model; the parental forest data model is composed of multiple trees, wherein each leaf node represents the simhash hash value of an article, and each root node represents the entire tree simhash hash value, the Hamming distance of the simhash hash value between any two root nodes is greater than the preset value, and the Hamming distance of the simhash hash value of any two nodes on the same tree is less than or equal to the preset value set value;

[0156] The retrieval unit 404 is configured to retrieve articles similar to the target article in the parent fores...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for retrieving similar articles, comprising: obtaining a target article; calculating a simhash hash value of the target article; obtaining a parental forest data model; and searching in the parental forest data model based on the simhash hash value of the target article Articles similar to the target article. Since the parent forest data model can make similar articles gather in one tree and return the same simhash hash value, it is convenient to deduplicate similar articles when using the simhash method, thereby calculating the number of similar articles. And retrieve these similar articles. Therefore, when an article is given in the prior art, the technical problem that it is impossible to calculate the number of articles similar to the given article based on simhash and retrieve these similar articles is solved.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method and device for retrieving similar articles. Background technique [0002] A basic product function of the public opinion (ie: public opinion situation) platform is to view articles related to the keyword according to the keyword specified by the user. In a period of time, a certain event will become a hot topic, and different media sites will take actions against the event (for example: writing articles, reprinting articles, citing articles, borrowing articles, and even stealing articles). There will be a large number of articles describing it on various media sites on the Internet. [0003] Due to the phenomenon of reprinting and borrowing from each other, many articles describing the incident are similar. For articles about this event, when users browse the public opinion platform, they have two requirements: 1) On the article list page, they want to see diverse...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & AuthorityPatents(China)
IPC IPC(8): G06F16/2458
Inventor梁忠平谢巍赵剑波杨棋张宏毅黄进龚健雷宁孙坤建
OwnerADVANCED NEW TECH CO LTD