Unlock instant, AI-driven research and patent intelligence for your innovation.

Similar text retrieval method and device

A text and similarity matching technology, applied in the field of computer retrieval, can solve problems such as slow comparison speed, large simhash editing distance, deviation, etc., and achieve the effect of preventing missed detection and reducing memory space consumption

Inactive Publication Date: 2016-11-23
ALIBABA GRP HLDG LTD
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditional text matching, such as edit distance calculation and vector space model (VSM), can measure the similarity between two texts very well, but for massive texts, it exposes the shortcomings of slow comparison speed and high space requirements
[0003] At present, fingerprint algorithms such as simhash are generally used in the industry to convert the text similarity problem into the edit distance calculation of simhash codes, thereby solving the indexing problem of massive texts. This method has a significant effect on the similarity calculation of large texts, but for medium Long or short text (within 100 characters), it will have a very high deviation (although the two texts have a high similarity, the edit distance of simhash is very large), which cannot be applied to scenarios that require high recall

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similar text retrieval method and device
  • Similar text retrieval method and device
  • Similar text retrieval method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In the following description, many technical details are proposed in order to enable readers to better understand the application. However, those skilled in the art can understand that without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in each claim of the present application can be realized.

[0023] Explanation of terms:

[0024] Text similarity: A measure to measure whether two texts are similar in semantics or characters. Usually, the closer the meaning of words in the text and the more repeated words, the higher the similarity.

[0025] In order to make the purpose, technical solution and advantages of the present invention clearer, the following will further describe the implementation of the present invention in detail in conjunction with the accompanying drawings.

[0026] The first embodiment of the present invention relates to a retrieval method for similar texts, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the field of computer retrieval, and discloses a similar text retrieval method and device. According to the similar text retrieval method and device, N indexes are preset in to-be-indexed texts, and respectively correspond to N different index algorithms, each index comprises a plurality of index values, and the index values correspond to at least one to-be-indexed text. The method comprises the following steps: respectively calculating the N index values, corresponding to the N indexes, of a text to be retrieved by using the N different index algorithms; respectively retrieving the to-be-indexed texts, with the same index value, as candidate similar documents in the corresponding indexes by using the N index values; conducting similarity match on the text to be retrieved and all candidate similar documents one by one to obtain the proximate text. According to the similar text retrieval method and device, the index values generated by a plurality of index algorithms are used to construct a plurality of indexes, the probability that the candidate similar documents obtained through the plurality of indexes contain the test similar to the text to be retrieved is greatly increased, the recall rate on retrieval of a plurality of short texts is relatively high, and missed retrieval can be prevented.

Description

technical field [0001] The invention relates to the field of computer retrieval, in particular to a retrieval method and device for similar texts. Background technique [0002] Text similarity refers to a measure of the degree of matching between two or more texts. The closer the semantics or word frequency between the texts, the higher the text similarity between them. Traditional text matching, such as edit distance calculation and vector space model (VSM), can measure the similarity between two texts very well, but for massive texts, it exposes the shortcomings of slow comparison speed and high space requirements. [0003] At present, fingerprint algorithms such as simhash are generally used in the industry to convert the text similarity problem into the edit distance calculation of simhash codes, thereby solving the indexing problem of massive texts. This method has a significant effect on the similarity calculation of large texts, but for medium Long or short text (wit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 郑侃侃
Owner ALIBABA GRP HLDG LTD
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More