Text repeatability judgment method and device, electronic equipment and storage medium

A judgment method and repetitive technology, applied in the Internet field, can solve the problems of unable to explain sentence repetition, inaccurate judgment of text repetition, large shingle vector space, etc.

Active Publication Date: 2020-02-04
KE COM (BEIJING) TECHNOLOGY CO LTD
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because the shingle vector space of the k-shingle algorithm is huge, especially when the value of k is large, the shingle vector space is larger, and the storage resources and computing resources consumed are high; in addition, the comparison granularity of the k-shingle algorithm is finer, Due to the limited numb

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text repeatability judgment method and device, electronic equipment and storage medium
  • Text repeatability judgment method and device, electronic equipment and storage medium
  • Text repeatability judgment method and device, electronic equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0088] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that relative arrangements of components and steps, numerical expressions and numerical values ​​set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

[0089] Those skilled in the art can understand that terms such as "first" and "second" in the embodiments of the present disclosure are only used to distinguish different steps, devices or modules, etc. necessary logical sequence.

[0090] It should also be understood that in the embodiments of the present disclosure, "plurality" may refer to two or more than two, and "at least one" may refer to one, two or more than two.

[0091] It should also be understood that any component, data or structure mentioned in the embodiments of the present disclosure can generally be understood as one or more unless there i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a text repeatability judgment method and device, electronic equipment and a storage medium, and the method comprises the steps: recognizing the length of a target text; if the length of the target text is greater than a first preset value, segmenting the target text into a plurality of text segments of which the lengths are not greater than the first preset value; respectively utilizing a signature simhash algorithm to obtain a signature value of each text segment in the plurality of text segments; judging whether the target text and the historical text are repeated or not based on the signature values of the plurality of text segments and inverted index information of the historical text in a distributed full-text search engine ElasticSearch; wherein inverted index information of at least one historical text is stored in the ElasticSearch, and the inverted index information of the historical text comprises a signature value of the historical text and a signature value of each text segment into which the historical text is segmented when the length of the inverted index information of the historical text is greater than a first preset value. According to the embodiment of the invention, the method achieves the repeated judgment of the text contents from the sentence dimension, and can effectively recognize a problem that a part of contents in two texts are repeated.

Description

technical field [0001] The present disclosure relates to Internet technologies, in particular to a method and device for determining text repetition, electronic equipment and storage media. Background technique [0002] Text similarity calculation has a wide range of applications in information retrieval, data mining, machine translation, document duplication detection and other fields. For example, when used to control public opinion on a microblog website, the sentences that need to be restricted can be entered into the database in advance, then when a user posts a microblog, the content of the user's income can be compared with the sentences in the database first, and if it matches the database The sentence in is not allowed to be issued by the user. [0003] In the process of realizing the present disclosure, the inventors found that the existing text similarity algorithm cannot well solve the phenomenon of text judgment and partial repetition. For example, for the com...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/31G06F16/33G06F40/289G06F40/12
CPCG06F16/319G06F16/3344
Inventor 李东升崔鸣江霜艳王文彬
Owner KE COM (BEIJING) TECHNOLOGY CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products