Document repeatability identification method and device, electronic equipment and storage medium

A recognition method and repeatable technology, applied in electrical digital data processing, instruments, calculations, etc., can solve problems affecting work efficiency, affecting accurate information transmission, etc.

Pending Publication Date: 2021-06-08
CHINA CONSTRUCTION BANK
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If it is not filtered, there will be a large number of duplicate documents or similar documents

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document repeatability identification method and device, electronic equipment and storage medium
  • Document repeatability identification method and device, electronic equipment and storage medium
  • Document repeatability identification method and device, electronic equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0027] figure 1 It is a flow chart of a document repeatability identification method provided by Embodiment 1 of the present application; this embodiment is applicable to the case of document repetition identification, especially applicable to the case of financial document repetition identification. The method can be executed by a document repeatability identification device, which is implemented by software / hardware, and can be integrated into an electronic device carrying a document repeatability identification function, such as a server.

[0028] Such as figure 1 As shown, the method may specifically include:

[0029] S110. Extract at least two target words of the target document, and construct a target word sequence of the target document according to the at least two target words.

[0030] Among them, the target document refers to the document that needs to be repeatedly identified; the target word refers to the word in the target document that can express the main mea...

Embodiment 2

[0047] figure 2 It is a flow chart of a document repeatability recognition method provided in Embodiment 2 of the present application; on the basis of the above embodiment, for "according to the one-hot encoding vector of each word in the target word sequence, determine the feature vector of the target document "Further optimization, providing an optional way.

[0048] Such as figure 2 As shown, the method may specifically include:

[0049] S210. Extract at least two target words of the target document, and construct a target word sequence of the target document according to the at least two target words.

[0050] S220. Determine the one-hot encoding vector of each word in the target word sequence.

[0051] S230. Based on the set sliding window, traverse the target word sequence to obtain at least two word segments.

[0052] In this embodiment, setting the sliding window is set by those skilled in the art according to the actual situation.

[0053] In this embodiment, b...

Embodiment 3

[0065] image 3 It is a flow chart of a document repetitive identification method provided in Embodiment 3 of the present application; on the basis of the above embodiments, a global signature and a local signature are added to further optimize the document repetitive identification method.

[0066] Such as image 3 As shown, the method may specifically include:

[0067] S310. Determine the global signature and local signature of the target document.

[0068] In this embodiment, the global signature is used to characterize the overall features of the target document; the local signature is used to characterize the salient local features of the target document.

[0069] Optionally, determining the global signature of the target document may be to perform a hash operation on text information in the target document to obtain the global signature of the target document. Specifically, the text information in the target document can be hashed using a Secure Hash Algorithm (SHA) t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a document repeatability identification method and device, electronic equipment and a storage medium. The method includes extracting at least two target words of a target document, and constructing a target word sequence of the target document according to the at least two target words, wherein the target words at least comprise nouns, verbs and quantity words in the target document; determining a one-hot coding vector of each word in the target word sequence; according to the one-hot coding vector of each word in the target word sequence, determining a feature vector of the target document; and determining the repeatability of the target document according to the distance between the feature vectors of other documents and the feature vector of the target document. According to the technical scheme, the accuracy of document query is improved on the premise of balancing the time complexity and the space complexity of document duplicate checking, and a new thought is provided for repeated recognition of documents.

Description

technical field [0001] The embodiments of the present application relate to the technical field of artificial intelligence, in particular to the technical field of natural language processing, and specifically to a method, device, electronic device, and storage medium for identifying repetitiveness of documents. Background technique [0002] With the development of Internet technology, all walks of life and various documents can be obtained from the Internet. For example, financial institutions access a large number of financial documents from the Internet every day, including market express, financial information, research reports, policy interpretations, announcements, etc. Many documents from different data sources are the same or similar. If it is not filtered, a large number of duplicate documents or similar documents will flood in, which will greatly affect the accurate transmission of information and affect work efficiency. Therefore, it is particularly important to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/284G06F40/44
CPCG06F40/284G06F40/44
Inventor 林妙真陈文森李木海林倩林宜领朱富荣
Owner CHINA CONSTRUCTION BANK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products