Improved file similarity measure method based on file structure
A technique of similarity measurement and document structure, applied in the fields of instrumentation, computing, electrical digital data processing, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0064] like Figure 4 As shown in , each document is composed of several subtopics around a central theme, and each subtopic is reflected on the document as a text block, that is, a group of word strings or sentences reflecting a certain subtopic. There are many ways to obtain document subtopics, such as text block segmentation method and sentence clustering method, etc. In preferred embodiment 1 of the present invention, the text block segmentation method (TextTiling) is used to analyze the document structure, and the process is as follows figure 1 The following steps are shown:
[0065] 1. Read in the two documents X and Y that need to be compared. For the two documents X and Y that need to be compared, use the text block segmentation method (TextTiling) to obtain the subtopic sequence X={x 1 , x 2 ,...,x n} and Y={y 1 ,y 2 ,...y m}, the specific steps are:
[0066] ① Segment the read document X, divide every 20 words into a word string, and the size of the word strin...
Embodiment 2
[0100] The second preferred embodiment of the present invention uses clustering technology to analyze the document structure, including the following steps:
[0101] 1. Read in the two documents X and Y that need to be compared, and use the clustering method to obtain the document subtopic sequence for the two documents X and Y respectively. The specific algorithm steps are:
[0102] ① Segment the read document and divide the document into n sentences;
[0103] ② Calculate the cosine similarity value between any two sentences;
[0104] ③ Use the data clustering method to cluster the sentences, and the text block composed of all the sentences in each category is a subtopic. In this embodiment, the aggregated clustering method is used to cluster sentences, and the steps are:
[0105] a. Initially, each sentence is classified into one category, and there are k clusters in total;
[0106] b. The two clusters with the largest similarity value among the existing k clusters c 1 a...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 