Unlock instant, AI-driven research and patent intelligence for your innovation.

Improved file similarity measure method based on file structure

A technique of similarity measurement and document structure, applied in the fields of instrumentation, computing, electrical digital data processing, etc.

Active Publication Date: 2008-08-20
PEKING UNIV
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] However, there is a shortcoming in the above method based on document structure, that is, the optimal matching model adopted only allows one subtopic of one document to correspond to one subtopic of another document, that is, only one-to-one matching between subtopics of documents is allowed. correspond

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved file similarity measure method based on file structure
  • Improved file similarity measure method based on file structure
  • Improved file similarity measure method based on file structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0064] like Figure 4 As shown in , each document is composed of several subtopics around a central theme, and each subtopic is reflected on the document as a text block, that is, a group of word strings or sentences reflecting a certain subtopic. There are many ways to obtain document subtopics, such as text block segmentation method and sentence clustering method, etc. In preferred embodiment 1 of the present invention, the text block segmentation method (TextTiling) is used to analyze the document structure, and the process is as follows figure 1 The following steps are shown:

[0065] 1. Read in the two documents X and Y that need to be compared. For the two documents X and Y that need to be compared, use the text block segmentation method (TextTiling) to obtain the subtopic sequence X={x 1 , x 2 ,...,x n} and Y={y 1 ,y 2 ,...y m}, the specific steps are:

[0066] ① Segment the read document X, divide every 20 words into a word string, and the size of the word strin...

Embodiment 2

[0100] The second preferred embodiment of the present invention uses clustering technology to analyze the document structure, including the following steps:

[0101] 1. Read in the two documents X and Y that need to be compared, and use the clustering method to obtain the document subtopic sequence for the two documents X and Y respectively. The specific algorithm steps are:

[0102] ① Segment the read document and divide the document into n sentences;

[0103] ② Calculate the cosine similarity value between any two sentences;

[0104] ③ Use the data clustering method to cluster the sentences, and the text block composed of all the sentences in each category is a subtopic. In this embodiment, the aggregated clustering method is used to cluster sentences, and the steps are:

[0105] a. Initially, each sentence is classified into one category, and there are k clusters in total;

[0106] b. The two clusters with the largest similarity value among the existing k clusters c 1 a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

To overcome the defect in prior art, the related method obtains sub-topic structures of A and B by document structure analysis; then, builds a weighted bipartite graph G, solves the cargo traffic distance EMD(A, B) by LP method; finally, it obtains the result according to 1-EMD(A,B). This invention improves decision accuracy with better robustness.

Description

technical field [0001] The invention belongs to the technical field of computer language processing and information retrieval, and in particular relates to an improved document similarity measurement method based on document structure. Background technique [0002] Document similarity measurement is a core issue in the field of text information processing. Many text applications, including document clustering, document retrieval, and document filtering, all rely on accurate measurement of document similarity. At present, many document similarity measurement methods have been proposed and applied, such as cosine measure, Jaccard measure, Dice measure (references: W.B.Frakes and R.Baeza-Yates: Information Retrieval, Data Structure and Algorithms, 1992), methods based on information theory (references: J.A.Aslam and M.Frost: AnInformation-theoretic Measure for Document Similarity.In Proceedings of SIGIR 2003), etc., among which the cosine metric method is the most widely used. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 万小军彭宇新杨建武吴於茜陈晓鸥
Owner PEKING UNIV