Document similarity measurement method and system based on keyword position structure distribution

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A similarity measurement and keyword technology, which is applied in unstructured text data retrieval, text database query, special data processing applications, etc.

Active Publication Date: 2019-08-27

ZHENJIANG COLLEGE

View PDF2 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] Purpose of the invention: In order to overcome the deficiencies of the prior art, the present invention provides a method for measuring document similarity based on keyword position structure distribution, which can solve the problem of document words and sentences The problem of deviation in measuring similarity from a semantic perspective; it can also avoid the problem of insufficient extraction of keywords in the full-text distribution structure feature of the document when existing methods measure similarity from the perspective of keywords. The present invention also provides a method based on keyword positions A Document Similarity Metric System Based on Structural Distribution

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0057] The present invention provides a method for measuring document similarity based on keyword position structure distribution, the method comprising:

[0058] S1 stores two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Word segmentation and stop word processing are performed separately, and segmentation marks are preserved.

[0059] S2 sets any target keyword set, in document W 1 with W 2 Find all the paragraph numbers and position information where each keyword appears, and mark them with triplets.

[0060] Given target keyword set S={s 1 ,s 2 ,...,s i ,...,s n}, n>1 is an integer, where, s i is a keyword, 1≤i≤n, for each keyword s in S i , in document W 1 Find occurrences of s in i All the paragraphs and positions of , for each occurrence position, extract its paragraph and position information, and mark the triplets in the following form (x, y, s i ), where x is the keyword s...

Embodiment 2

[0087] The present invention also provides a document similarity measurement system based on keyword position structure distribution, including:

[0088] Document preprocessing module 1, used to store two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Word segmentation and stop word processing are performed separately, and segmentation marks are retained;

[0089] keyword search module 2, used to set any target keyword set, in the document W 1 with W 2 Find all the paragraph numbers and position information where each keyword appears, and mark them with triplets;

[0090] The keyword search module also includes a position calculation unit 21 for calculating the keyword s i The position information in the natural segment, specifically: if the keyword s i The total number of words in a natural paragraph is sum; the keyword s in the natural paragraph i The previous word count is recorded as p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a document similarity measurement method and system based on keyword position structure distribution, which comprises the following steps of storing two documents W1 and W2 each of which is provided with a plurality of natural segments; setting any target keyword set, searching all paragraph numbers and the position information of each keyword in the document W1 and the document W2, and marking the paragraph numbers and the position information by adopting a triple; generating the position distribution sequences of the keywords in the documents W1 and W2 according to the paragraph numbers and the position information; and calculating the similarity of the position distribution sequences of the keywords in the documents W1 and W2 according to the position distribution sequences of the keywords in the documents W1 and W2, thereby obtaining the weighted similarity of the two documents. According to the document similarity measurement method, the deviation of the document word and sentence semantic angle measurement similarity can be avoided, the defect that when the similarity is measured from the angle of keywords in an existing method, the features of the keywords are extracted from the document full-text distribution structure can be overcome, the practicability is higher, and the accuracy is higher.

Description

technical field [0001] The present invention relates to the technical field of document similarity measurement, in particular to a document similarity measurement method and system based on keyword position structure distribution. Background technique [0002] The similarity analysis and calculation between documents has a wide range of applications in information retrieval, data mining, machine translation, document duplication detection and other fields. A brief introduction to common document similarity calculation methods is as follows: cosine similarity, which converts documents into vector models based on keywords, and measures them by calculating the cosine similarity of documents; simple shared lexical method, which calculates the total number of characters of words shared by two documents Divide by the longest document character count to evaluate document similarity. Edit distance, also known as Levenshtein distance, is measured by the minimum number of editing ope...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/33G06F17/27

CPCG06F16/3344G06F40/258G06F40/289

Inventor陆介平倪巍伟杨春立李爱东

OwnerZHENJIANG COLLEGE

Document similarity measurement method and system based on keyword position structure distribution

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology