Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document similarity measurement method and system based on keyword position structure distribution

A similarity measurement and keyword technology, which is applied in unstructured text data retrieval, text database query, special data processing applications, etc.

Active Publication Date: 2019-08-27
ZHENJIANG COLLEGE
View PDF2 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Purpose of the invention: In order to overcome the deficiencies of the prior art, the present invention provides a method for measuring document similarity based on keyword position structure distribution, which can solve the problem of document words and sentences The problem of deviation in measuring similarity from a semantic perspective; it can also avoid the problem of insufficient extraction of keywords in the full-text distribution structure feature of the document when existing methods measure similarity from the perspective of keywords. The present invention also provides a method based on keyword positions A Document Similarity Metric System Based on Structural Distribution

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity measurement method and system based on keyword position structure distribution
  • Document similarity measurement method and system based on keyword position structure distribution
  • Document similarity measurement method and system based on keyword position structure distribution

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] The present invention provides a method for measuring document similarity based on keyword position structure distribution, the method comprising:

[0058] S1 stores two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Word segmentation and stop word processing are performed separately, and segmentation marks are preserved.

[0059] S2 sets any target keyword set, in document W 1 with W 2 Find all the paragraph numbers and position information where each keyword appears, and mark them with triplets.

[0060] Given target keyword set S={s 1 ,s 2 ,...,s i ,...,s n}, n>1 is an integer, where, s i is a keyword, 1≤i≤n, for each keyword s in S i , in document W 1 Find occurrences of s in i All the paragraphs and positions of , for each occurrence position, extract its paragraph and position information, and mark the triplets in the following form (x, y, s i ), where x is the keyword s...

Embodiment 2

[0087] The present invention also provides a document similarity measurement system based on keyword position structure distribution, including:

[0088] Document preprocessing module 1, used to store two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Word segmentation and stop word processing are performed separately, and segmentation marks are retained;

[0089] keyword search module 2, used to set any target keyword set, in the document W 1 with W 2 Find all the paragraph numbers and position information where each keyword appears, and mark them with triplets;

[0090] The keyword search module also includes a position calculation unit 21 for calculating the keyword s i The position information in the natural segment, specifically: if the keyword s i The total number of words in a natural paragraph is sum; the keyword s in the natural paragraph i The previous word count is recorded as p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document similarity measurement method and system based on keyword position structure distribution, which comprises the following steps of storing two documents W1 and W2 each of which is provided with a plurality of natural segments; setting any target keyword set, searching all paragraph numbers and the position information of each keyword in the document W1 and the document W2, and marking the paragraph numbers and the position information by adopting a triple; generating the position distribution sequences of the keywords in the documents W1 and W2 according to the paragraph numbers and the position information; and calculating the similarity of the position distribution sequences of the keywords in the documents W1 and W2 according to the position distribution sequences of the keywords in the documents W1 and W2, thereby obtaining the weighted similarity of the two documents. According to the document similarity measurement method, the deviation of the document word and sentence semantic angle measurement similarity can be avoided, the defect that when the similarity is measured from the angle of keywords in an existing method, the features of the keywords are extracted from the document full-text distribution structure can be overcome, the practicability is higher, and the accuracy is higher.

Description

technical field [0001] The present invention relates to the technical field of document similarity measurement, in particular to a document similarity measurement method and system based on keyword position structure distribution. Background technique [0002] The similarity analysis and calculation between documents has a wide range of applications in information retrieval, data mining, machine translation, document duplication detection and other fields. A brief introduction to common document similarity calculation methods is as follows: cosine similarity, which converts documents into vector models based on keywords, and measures them by calculating the cosine similarity of documents; simple shared lexical method, which calculates the total number of characters of words shared by two documents Divide by the longest document character count to evaluate document similarity. Edit distance, also known as Levenshtein distance, is measured by the minimum number of editing ope...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/33G06F17/27
CPCG06F16/3344G06F40/258G06F40/289
Inventor 陆介平倪巍伟杨春立李爱东
Owner ZHENJIANG COLLEGE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products