Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method and device for determining document similarity based on document mixing features

A technology of document similarity and mixed features, which is applied in the field of computer search, can solve the problems of not considering document attribute features and keyword features, document similarity correlation, etc., and achieve easy understanding, solid technical ability, and improved judgment ability Effect

Active Publication Date: 2021-06-01
北京明朝万达科技股份有限公司
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] (2) The document attribute characteristics in the document are not considered
[0010] (3) The document similarity relationship between keyword features, regular features and document attributes is not considered

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for determining document similarity based on document mixing features
  • A method and device for determining document similarity based on document mixing features

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0073] An enterprise judges the similarity of documents containing user salary information. The salary information in the document includes the user's name, ID card, bank card, mobile phone number, etc., and also includes the establishment of matching rules

[0074] Feature 1 Feature 2 Feature 3

[0075] ID card, UnionPay card number, mobile phone number...

[0076] 1. Determine the regular expression of the ID card, bank card, mobile phone, etc.; determine the post-processing script of the ID card, consider the province, date of birth, and whether the last digit of the ID card is correct, and consider the card bin at the beginning of the UnionPay bank card number. The luhn verification of the bank account number should be considered; finally, the sequence xyz of the above three characteristics is formed.

[0077] 2. Extract keywords related to salary information in the document, such as position level, department information, performance, allowance, etc.

[0078] 3. Extrac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and device for judging document similarity based on document mixing features. The method includes the following steps: performing regular expression matching on input files or data streams; if the matching fails, the process ends; if the matching succeeds, the Multiple feature strings output by regular expression matching are reprocessed; multiple results of feature reprocessing are managed in linked lists to form multiple feature linked lists; linked list traversal and feature merging are performed on multiple feature linked lists; the output is similar degree judgment results. Through this solution, the identification ability of table data in structured documents is greatly improved, and the ability to judge the similarity of excel table type documents can be greatly improved. It is faster, easier to understand, suitable for actual business needs, and provides a solid foundation for data management and control. technical skills.

Description

technical field [0001] The invention relates to the field of computer search, in particular to a method and device for judging document similarity based on document mixing features. Background technique [0002] Document similarity judgment is widely used in various applications such as Internet search, public opinion report, and enterprise classification. Therefore, whether it is a structured table type document or an unstructured character type document, there are many methods for text similarity recognition. [0003] However, documents containing tables are commonly used formats in the daily business of enterprises, and often contain more business information or sensitive data of enterprises. For example, in a financial report, apart from the descriptive text, the tables in the report may contain more sensitive information, such as various financial indicators of the company. This kind of unstructured document containing many tables is different from both structured doc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/194G06F40/30
CPCG06F40/194G06F40/30
Inventor 魏效征王志海喻波安鹏
Owner 北京明朝万达科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products