Text similarity computing method and device

A technology of text similarity and calculation method, which is applied in the field of text similarity calculation method and device, can solve the problems of dependence on training samples, impractical large text or big data environment, low calculation efficiency, etc., and achieve the effect of improving calculation efficiency

Active Publication Date: 2018-01-30
ALIBABA GRP HLDG LTD
View PDF5 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, although the above first scheme can efficiently obtain the Hamming distance of two texts, it discards the semantics of the content and only calculates the distance from the perspective of text strings; when the text is short, the comparison effect is not ideal
Moreover, the calculation result of the first scheme is a distance value, not a similarity, which is not convenient for subsequent business processing
Although the above-mentioned second solution can represent text semantics very well through machine learning, the training process of the model used in it is very time-consuming and relies heavily on training samples, which may cause calculation errors even for simple sentences The problem
And the cosine calculation efficiency between high-dimensional vectors is low, which is not practical for large text or big data environments
[0006] In summary, it can be seen that the text similarity calculation scheme in the related art has low calculation efficiency and low accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity computing method and device
  • Text similarity computing method and device
  • Text similarity computing method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0062] An embodiment of the present application provides a text similarity calculation method for calculating the similarity between two texts. The text described in this embodiment may include: public opinion texts such as news, microblogs, and forum articles. In this embodiment, the two texts for similarity calculation can be the same type of public opinion texts, for example, it can be two news texts, or two microblogs, or two forum articles; or, the similarity calculation The two texts can also be different types of public opinion texts, for example, it can be a news text and a microblog, or a microblog and a forum article, or a news text and a forum article. However, the present application is not limited to this.

[0063] In this embodiment, data of at least two objects can be extracted from each text. Wherein, the data of the object includes, for example: hash value and / or word vector data. Taking the SimHash algorithm to extract the hash value and the word2vec model...

example 1

[0129] In this example, the calculation of the similarity between text A and text B is taken as an example for illustration. Among them, the title SimHash value a1, content SimHash value a2, keyword SimHash value a3, core sentence SimHash values ​​a4, a5, a6 (here, take three core sentences as an example), three core sentences are extracted from text A. The total SimHash value a7 of the sentence, the VSM of the title and the VSM of the keyword; from the text B, the SimHash value b1 of the title, the SimHash value b2 of the content, the SimHash value b3 of the keyword, and the SimHash values ​​b4, b5, and b6 of the core sentence (in Here, take three core sentences as an example), the total SimHash value b7 of the three core sentences, the VSM of the title and the VSM of the keywords. Here, the common objects between text A and text B are: title, content, keywords and key sentences.

[0130] Such as Figure 6 As shown, the similarity calculation process between text A and text...

example 2

[0181] In this example, the calculation of the similarity between text A and text B is taken as an example for illustration. Among them, the title SimHash value a1, the content SimHash value a2, the SimHash values ​​a4, a5, and a6 of the core sentence are extracted from the text A (here, take three core sentences as an example), and the total SimHash value a7 of the three core sentences , and the VSM of the title; from the text B, extract the SimHash value b1 of the title, the SimHash value b2 of the content, the SimHash values ​​b4, b5, and b6 of the core sentence (here, take three core sentences as an example), the values ​​of the three core sentences The total SimHash value b7, and the VSM of the title. Here, the common objects between text A and text B are: title, content, and core sentences.

[0182] Such as Figure 9 As shown, the similarity calculation process between text A and text B includes the following steps:

[0183] Step 201: respectively calculate the conten...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text similarity computing method for computing similarity between two texts. The data of at least two objects can be extracted from each text, and the object indicates the feature capable of reflecting the text semantic. The method comprises the following steps: determining shared objects of two texts, wherein the number of the shared object is at least two; computing theHamming distance of each shared object between two texts; determining the similarity between two texts according to at least one of the follows when the Hamming distance of each of at least two shared objects satisfies a first preset condition: word vector similarity of predetermined objects in at least two shared objects, the Hamming distance, and the splicing character string similarity. Through the above-mentioned scheme, the efficiency and the accuracy of the text similarity computation can be improved.

Description

technical field [0001] The invention relates to the field of data processing, in particular to a text similarity calculation method and device. Background technique [0002] At present, the similarity calculation between texts is applied to many aspects. In related technologies, the following two schemes may be used for text comparison. [0003] The first solution is: after segmenting the long text into words, perform hash calculation for each word, and use frequency as weight to obtain a vector, and then binarize the vector to obtain the hash value of the text. Determine the Hamming distance from the hash values ​​between texts. The above scheme is widely used in web page deduplication fields such as Google and Baidu. [0004] The second solution is: use topic models such as document topic generation model (LDA, Latent Dirichlet Allocation) or probabilistic latent semantic analysis (PLSA, Probability Latent Semantic Analysis), and use machine learning to map text into to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 刘力华
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products