XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence

A technology of structural similarity and tag sequence, which is applied in the field of measuring the structural similarity of XML documents, can solve the problems of loss of correct rate and inaccurate structural similarity, and achieve the effect of accurate similarity and improved accuracy

Active Publication Date: 2012-06-27
SHANGHAI DIAN TECH INC
View PDF2 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Since the information contained in the XML document is not fully utilized, the structural similarity between the documents calculated by the above methods is not accurate enough, and there is a certain loss of accuracy when it is applied to XML document clustering or classification.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence
  • XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence
  • XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] For a given set of XML documents C, the specific flow of the present invention to calculate the similarity between any two documents is as follows: figure 2 shown, including the following steps:

[0045] 1. Preprocess the document set to obtain the tag sequence database TSDB. Processing flow such as image 3 As shown, in the parsing process, the same path of the same XML document only appears once in TSDB. In the figure, d_TS represents the set of tag sequences contained in document d, and d.id represents the identity of document d.

[0046] The label sequence refers to an ordered list composed of multiple labels in the label set. The order of tags is the order of paths from the root node to the leaf nodes in the tag tree corresponding to the XML document. The tag sequence α can be formally expressed as: 1 , a 2 , L, a n >, where a i is a label in the label set, the number of labels contained in it is called the length of the label sequence, and the label sequenc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an XML (Extensive Makeup Language) structural similarity measuring method based on a frequency-associated tag sequence. The method comprises the following steps of: resolving an XML document set C to obtain a tag sequence database (TSDB); excavating all frequency tag sequence sets (FTS) from the TSDB; selecting a maximum frequency tag sequence set (MFTS) from the FTS; converting to obtain a new TSDB'; excavating a closed frequency-associated tag sequence set from the TSDB'; and expressing any document in the TSDB' as a closed frequency-associated tag sequence set whichis contained in the TSDB', and calculating the structural similarity between any two documents in the document set C. According to the method, the accuracy of a clustering result can be raised.

Description

technical field [0001] The invention belongs to the technical field of data management, and relates to a method for measuring the structural similarity of XML documents, in particular to a method for measuring the structural similarity of XML documents by using frequently associated tag sequences mined from XML document collections as features. Background technique [0002] As the de facto standard of data representation and data exchange on the Internet, XML has been widely used. With the continuous growth of the number of XML documents, how to effectively store, filter, retrieve and manage XML data is becoming more and more important in the field of database and information retrieval. Many operations on XML need to measure the similarity between XML documents. The similarity measurement of XML documents has become a basic problem of many XML processing technologies and has been applied in many fields, such as semi-structured data integration, XML document Classification / c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张利军李战怀陈群李霞
Owner SHANGHAI DIAN TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products