Text repeatability determination method and device, electronic equipment and storage medium

A judging method and judging device technology, applied in the Internet field, can solve problems such as easy misjudgment, inaccurate judgment of text repeatability, high storage resources and computing resources

Active Publication Date: 2020-07-24
KE COM (BEIJING) TECHNOLOGY CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because the shingle vector space of the k-shingle algorithm is huge, especially when the value of k is large, the shingle vector space is larger, and the storage resources and computing resources consumed are high; in addition, the comparison granularity of the k-shingle algorithm is finer, Due to the limited number of words, the repetition of words in two texts does not mean that the sentences are repeated. Based on the k-shingle algorithm, as long as there are word repetitions between the two texts, it will be judged as two texts. It is easy to misjudgment, resulting in The repetition judgment of the text is not accurate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text repeatability determination method and device, electronic equipment and storage medium
  • Text repeatability determination method and device, electronic equipment and storage medium
  • Text repeatability determination method and device, electronic equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0088] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that unless specifically stated otherwise, the relative arrangement, numerical expressions and numerical values ​​of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0089] Those skilled in the art can understand that terms such as “first” and “second” in the embodiments of the present disclosure are only used to distinguish different steps, devices, or modules, etc., and do not represent any specific technical meaning, nor do they mean that The necessary logical order.

[0090] It should also be understood that in the embodiments of the present disclosure, "plurality" may refer to two or more than two, and "at least one" may refer to one, two or more than two.

[0091] It should also be understood that any component, data, or structure mentioned in the embodiments of ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a text repeatability judgment method and device, electronic equipment and a storage medium, and the method comprises the steps: recognizing the length of a target text; if the length of the target text is greater than a first preset value, segmenting the target text into a plurality of text segments of which the lengths are not greater than the first preset value; respectively utilizing a signature simhash algorithm to obtain a signature value of each text segment in the plurality of text segments; judging whether the target text and the historical text are repeated or not based on the signature values of the plurality of text segments and inverted index information of the historical text in a distributed full-text search engine ElasticSearch; wherein inverted index information of at least one historical text is stored in the ElasticSearch, and the inverted index information of the historical text comprises a signature value of the historical text and a signature value of each text segment into which the historical text is segmented when the length of the inverted index information of the historical text is greater than a first preset value. According to the embodiment of the invention, the method achieves the repeated judgment of the text contents from the sentence dimension, and can effectively recognize a problem that a part of contents in two texts are repeated.

Description

Technical field [0001] The present disclosure relates to Internet technology, in particular to a method and device for judging text repeatability, electronic equipment, and storage media. Background technique [0002] Text similarity calculation has a wide range of applications in the fields of information retrieval, data mining, machine translation, and document copy detection. For example, when it is used to control public opinion on a Weibo website, the sentences that need to be restricted can be entered into the database in advance. Then when a user posts a Weibo, the content of the user’s income can be compared with the sentences in the database first, if it matches the database The sentence in is not allowed to be sent by the user. [0003] In the process of implementing the present disclosure, the inventor found that the existing text similarity algorithm cannot well solve the text judgment and partial repetition phenomenon. For example, for the commonly used K-shingle alg...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/31G06F16/33G06F40/289G06F40/12
CPCG06F16/319G06F16/3344
Inventor 李东升崔鸣江霜艳王文彬
Owner KE COM (BEIJING) TECHNOLOGY CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products