Improved file similarity measure method based on file structure

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technique of similarity measurement and document structure, applied in the fields of instrumentation, computing, electrical digital data processing, etc.

Active Publication Date: 2008-08-20

PEKING UNIV

View PDF4 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0013] However, there is a shortcoming in the above method based on document structure, that is, the optimal matching model adopted only allows one subtopic of one document to correspond to one subtopic of another document, that is, only one-to-one matching between subtopics of documents is allowed. correspond

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0064] like Figure 4 As shown in , each document is composed of several subtopics around a central theme, and each subtopic is reflected on the document as a text block, that is, a group of word strings or sentences reflecting a certain subtopic. There are many ways to obtain document subtopics, such as text block segmentation method and sentence clustering method, etc. In preferred embodiment 1 of the present invention, the text block segmentation method (TextTiling) is used to analyze the document structure, and the process is as follows figure 1 The following steps are shown:

[0065] 1. Read in the two documents X and Y that need to be compared. For the two documents X and Y that need to be compared, use the text block segmentation method (TextTiling) to obtain the subtopic sequence X={x 1 , x 2 ,...,x n} and Y={y 1 ,y 2 ,...y m}, the specific steps are:

[0066] ① Segment the read document X, divide every 20 words into a word string, and the size of the word strin...

Embodiment 2

[0100] The second preferred embodiment of the present invention uses clustering technology to analyze the document structure, including the following steps:

[0101] 1. Read in the two documents X and Y that need to be compared, and use the clustering method to obtain the document subtopic sequence for the two documents X and Y respectively. The specific algorithm steps are:

[0102] ① Segment the read document and divide the document into n sentences;

[0103] ② Calculate the cosine similarity value between any two sentences;

[0104] ③ Use the data clustering method to cluster the sentences, and the text block composed of all the sentences in each category is a subtopic. In this embodiment, the aggregated clustering method is used to cluster sentences, and the steps are:

[0105] a. Initially, each sentence is classified into one category, and there are k clusters in total;

[0106] b. The two clusters with the largest similarity value among the existing k clusters c 1 a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

To overcome the defect in prior art, the related method obtains sub-topic structures of A and B by document structure analysis; then, builds a weighted bipartite graph G, solves the cargo traffic distance EMD(A, B) by LP method; finally, it obtains the result according to 1-EMD(A,B). This invention improves decision accuracy with better robustness.

Description

technical field [0001] The invention belongs to the technical field of computer language processing and information retrieval, and in particular relates to an improved document similarity measurement method based on document structure. Background technique [0002] Document similarity measurement is a core issue in the field of text information processing. Many text applications, including document clustering, document retrieval, and document filtering, all rely on accurate measurement of document similarity. At present, many document similarity measurement methods have been proposed and applied, such as cosine measure, Jaccard measure, Dice measure (references: W.B.Frakes and R.Baeza-Yates: Information Retrieval, Data Structure and Algorithms, 1992), methods based on information theory (references: J.A.Aslam and M.Frost: AnInformation-theoretic Measure for Document Similarity.In Proceedings of SIGIR 2003), etc., among which the cosine metric method is the most widely used. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30G06F17/27

Inventor 万小军彭宇新杨建武吴於茜陈晓鸥

Owner PEKING UNIV

Improved file similarity measure method based on file structure

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology