Text similarity distinguishing method and apparatus

A technology of text similarity and discrimination method, applied in the Internet field, can solve the problems of high misjudgment rate, easy to cause misjudgment, and heavy workload.

Active Publication Date: 2017-08-22
TENCENT TECH (SHENZHEN) CO LTD
View PDF20 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the inventors of the present invention have found that in the prior art, there are at least the following problems in judging the repetition of a large number of texts: the misjudgment rate of the results after word segmentation and sentence segmentation is high and ...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity distinguishing method and apparatus
  • Text similarity distinguishing method and apparatus
  • Text similarity distinguishing method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0101] The present invention provides a method for judging text similarity, such as figure 1 As shown, the method at least includes the following steps:

[0102] S101: Acquire the text to be tested.

[0103] Text refers to the form of expression of written language. From a literary perspective, it is usually a sentence or a combination of multiple sentences with a complete and systematic meaning (Message). A text can be a sentence (Sentence), a paragraph (Paragraph) or a chapter (Discourse). Broad "text": any utterance fixed by writing. Narrow sense "text": a literary entity composed of language and characters, referring to "work", which constitutes an independent and self-contained system relative to the author and the world.

[0104] Text is mainly used to record and store text information, rather than images, sounds and formatted data. The extensions of common text documents are: .txt, .doc., .docx, .wps, etc.

[0105] The text to be tested in this application can contain one or...

Embodiment 2

[0182] Such as Figure 5 As shown, the present invention provides another method for judging text similarity, including:

[0183] S501: Write data to the full database; the full database is used to store the mapping relationship between at least one sentence of the first text and the first text name; wherein each sentence in the full database corresponds to a unique first text name.

[0184] The writing of data to the full database includes:

[0185] Obtain at least one first text;

[0186] Parse the first text, and extract sentences in the first text;

[0187] Query the sentence in the first text in the full database;

[0188] If found, delete the relevant record of the sentence from the full database;

[0189] If not found, the mapping relationship between the sentence and the name of the first text corresponding to the sentence is stored in the full database.

[0190] S502: Write data to the single database of each first text.

[0191] The writing data into the single database of each fi...

Embodiment 3

[0225] According to the embodiment of the present invention, there is also provided a device for implementing the above-mentioned text similarity judgment method, Image 6 Is a schematic diagram of a text similarity judging device according to an embodiment of the present invention, such as Image 6 As shown, the device includes:

[0226] The test text obtaining module 10 is used to obtain the test text.

[0227] The text sentence extraction module 20 to be tested is used to parse the text to be tested and extract at least part of the sentences of the text to be tested.

[0228] The query module 30 is used to query the sentence of the at least part of the text to be tested in a pre-established full database; the full database stores at least one mapping relationship between a sentence of the first text and the name of the first text; wherein the full database Each sentence in the database corresponds to a unique first text name.

[0229] The similarity discrimination module 40 is used...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text similarity distinguishing method and apparatus. The text similarity distinguishing method comprises the steps of obtaining a to-be-tested text; analyzing the to-be-tested text and extracting sentences of at least one part of the to-be-tested text; searching the sentences of at least one part of the to-be-tested text from a pre-established full-quantity database; and generating similarity between the to-be-tested text and a first text according to a searching result. The full-quantity database provided by the invention stores a mapping relation between the sentence of at least one first text and a first text name; and each sentence in the full-quantity database is corresponding to a unique first text name. By ensuring the one-to-one correspondence relation between the sentences stored in the full-quantity database and the first texts, a unique matching result can be obtained in searching sentences from the full-quantity database. Sentences corresponding to more than one first text at the same time are eliminated from the full-quantity database, so that sentence hit rate and searching speed of the target first text are improved.

Description

Technical field [0001] The present invention relates to the field of Internet technology, in particular to a method and device for judging text similarity. Background technique [0002] At present, the hash-based similarity calculation method is mainly used for text similarity discrimination. This method is a method of reducing the dimensionality of high-dimensional data based on probability. It is mainly used for large-scale data compression and real-time or fast computing scenarios. , The similarity calculation based on the hash method is often used in the case of high-dimensional and large data volume. The problem of using original information to be unstorable and calculating is transformed into a storable calculation problem of mapping space. In terms of judging the repeatability of massive texts, approximate text There are many applications for query, such as google web deduplication, google news collaborative filtering, etc., all use hash method to calculate approximate sim...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/3344G06F40/279
Inventor 戴礼松许泽伟蔡晓鹏张渝姜江曾刘彬
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products