Repeated text judgment method and apparatus

A judgment method and text technology, applied in the field of information processing, can solve the problems of strict similarity judgment conditions, low calculation efficiency of cosine similarity algorithm, and lack of flexibility of the algorithm.

Inactive Publication Date: 2017-03-22
LETV HLDG BEIJING CO LTD +1
View PDF7 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, both of the above algorithms have drawbacks
Specifically, text conversion and vector angle calculation in the cosine similarity algorithm require a large amount of calculation, which makes the calculation efficiency of the cosine simila

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Repeated text judgment method and apparatus
  • Repeated text judgment method and apparatus
  • Repeated text judgment method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0089] refer to figure 1 , shows a flow chart of a method for judging repeated text according to Embodiment 1 of the present invention, and the method may specifically include the following steps:

[0090] Step 101, judging whether the summary information respectively corresponding to the first text and the second text is repeated.

[0091] In the embodiment of the present invention, the first text and the second text may be news, articles, papers, etc. Correspondingly, the summary information may be news titles, article titles, article keywords, paper titles, paper summaries, paper keywords, etc. It may also be a combination of the above information. The embodiment of the present invention can be applied to webpage text, and the summary information can be one or more combinations of information such as a network address corresponding to the webpage text, keywords of the webpage text, and the like.

[0092] When judging the first text and the second text, you can first judge...

Embodiment 2

[0106] refer to figure 2 , shows a flow chart of a method for judging repeated text according to Embodiment 2 of the present invention, and the method may specifically include the following steps:

[0107] Step 201. Determine whether the similarity between the first summary information of the first text and the second summary information of the second text is greater than or equal to a preset similarity threshold.

[0108] In the embodiment of the present invention, the first text and the second text may be news, articles, papers, etc. Correspondingly, the summary information may be news titles, article titles, article keywords, paper titles, paper summaries, paper keywords, etc. It may also be a combination of the above information.

[0109] When judging whether two texts are repeated in the embodiment of the present invention, the summary information of the two texts is first repeatedly judged. Specifically, the similarity between the two summary information can be calcula...

Embodiment 3

[0144] refer to image 3 , which shows a structural block diagram of an apparatus for determining repeated text according to Embodiment 1 of the present invention, which may specifically include:

[0145] The summary information judging module 301 is configured to judge whether the summary information respectively corresponding to the first text and the second text is repeated.

[0146] The characteristic content extraction module 302 is configured to extract the characteristic content of the first text and the second text respectively if the summary information is not repeated.

[0147] A feature content identification module 303, configured to identify whether the feature content respectively corresponding to the first text and the second text is repeated.

[0148] The repeated text determining module 304 is configured to determine that the first text and the second text are repeated if the characteristic content is repeated.

[0149] According to the embodiment of the pre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a repeated text judgment method and apparatus. The method comprises the steps of judging whether summary information corresponding to a first text and a second text is repeated or not; if the summary information is not repeated, extracting feature contents of the first text and the second text; identifying whether the feature contents corresponding to the first text and the second text are repeated or not; and if the feature contents are repeated, judging that the first text and the second text are repeated. According to the method and the apparatus provided by embodiments of the invention, the judgment of repeated texts can be finished by using the method for texts with the same summary information; for texts with different summary information, the feature contents are extracted and the feature contents of the texts are judged, so that the judgment of the repeated texts is finished; and the calculation amount required for the judgment process of the repeated texts is relatively small, the judgment efficiency is relatively high, and an algorithm is flexible to use.

Description

technical field [0001] The invention relates to information processing technology, in particular to a method for judging repeated texts and a device for judging repeated texts. Background technique [0002] In the process of text processing, the text deduplication method is often used to remove repeated information in the text. At present, when determining whether two texts are duplicated, the similarity of the two texts is usually calculated first, and then the calculated similarity is judged. If the calculated similarity is less than the similarity threshold, the two texts are determined to be duplicated. [0003] Commonly used text similarity algorithms include cosine similarity algorithm and text hash algorithm. Among them, the cosine similarity algorithm converts text into vectors, calculates the cosine value of the angle between the vectors, and pre-smalls the similarity of the two texts The higher the text hash algorithm is, the text is mapped to the corresponding ha...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22
CPCG06F40/194
Inventor 康潮明
Owner LETV HLDG BEIJING CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products