Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document similarity measurement method and system based on keyword sequence structure

A technology of keyword sequence and document similarity, applied in unstructured text data retrieval, text database query, special data processing applications, etc. question

Active Publication Date: 2019-08-27
ZHENJIANG COLLEGE
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a method for measuring document similarity based on keyword sequence structure, which can solve the semantics of document words and sentences The problem of the deviation of angle measurement similarity; also can avoid existing method when measuring similarity from keyword angle, to the insufficient problem that keyword is extracted in document full-text distribution structural feature, the present invention also provides a kind of based on keyword sequence structure document similarity measurement system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity measurement method and system based on keyword sequence structure
  • Document similarity measurement method and system based on keyword sequence structure
  • Document similarity measurement method and system based on keyword sequence structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] The present invention provides a method for measuring document similarity based on keyword position structure distribution, the method comprising:

[0038] S1 stores two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Separate word segmentation and stop word processing.

[0039] S2 sets the keyword sequence, in the document W 1 with W 2 Search for the set of positions where all keywords in the keyword sequence appear in the keyword sequence;

[0040] keyword sequence S in W 1 An occurrence in means that m keywords in sequence S appear in document W 1 appears once in sequence. In the document W 1 Search for a certain occurrence of the keyword sequence S in , which can be recorded as: get the occurrence positions of m keywords Ponit={p 1 ,p 2 ,...,p m}, all occurrences form the set of occurrences of S in the document, where p i for keywords i In the document W 1 A certain occurr...

Embodiment 2

[0054] The present invention also provides a document similarity measurement system based on keyword sequence structure, comprising:

[0055] Document preprocessing module 1, used to store two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Separate word segmentation and stop word processing;

[0056] Appearance location statistics module 2, used to set the keyword sequence, and in the document W 1 with W 2 Search for the set of positions where all keywords in the keyword sequence appear in the keyword sequence;

[0057] keyword sequence S in W 1 An occurrence in means that m keywords in sequence S appear in document W 1 appears once in sequence. In document W 1 Find a certain occurrence of the keyword sequence S in , and obtain the occurrence positions of m keywords Ponit={p 1 ,p 2 ,...,p m}, all occurrences form the set of occurrences of S in the document, where p i for keywords i I...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document similarity measurement method based on a keyword sequence structure, and the method comprises the steps of storing a document W1 and a document W2, setting a keywordsequence, and searching a position set where all keywords in the keyword sequence appear in the document W1 and the document W2; generating the feature sets of the keyword sequences in the documentsW1 and W2 according to the positions where the keywords appear so as to obtain the structural feature values of the keyword sequences in the documents W1 and W2; and calculating the similarity of thedocuments W1 and W2 with respect to the keyword sequence according to the structural characteristic values of the keyword sequence in W1 and W2. The method is beneficial to avoiding the deviation of the semantic angle measurement similarity of the words and sentences of the document, can avoid the defect that the causal relationship before and after a group of keywords is ignored in document distribution structure feature extraction when the similarity is measured from the angle of the keywords in an existing method, and is stronger in practicability and higher in accuracy.

Description

technical field [0001] The invention relates to the technical field of document similarity measurement, in particular to a method and system for document similarity measurement based on a keyword sequence structure. Background technique [0002] The similarity analysis and calculation between documents has a wide range of applications in information retrieval, data mining, machine translation, document duplication detection and other fields. A brief introduction to common document similarity calculation methods is as follows: cosine similarity, which converts documents into vector models based on keywords, and measures them by calculating the cosine similarity of documents; simple shared lexical method, which calculates the total number of characters of words shared by two documents Divide by the longest document character count to evaluate document similarity. Edit distance, also known as Levenshtein distance, is measured by the minimum number of editing operations require...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06F17/22G06F16/33G06F16/335
CPCG06F16/3344G06F16/335G06F40/194G06F40/289G06F40/30
Inventor 陆介平倪巍伟杨春立李爱东
Owner ZHENJIANG COLLEGE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products