Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A web data similarity detection method based on two-level filtering of structure and content

A detection method and secondary filtering technology, applied in the field of similarity detection of Web data structure and content, can solve the problems of not making full use of the characteristics of Web data distribution area, difficult to find approximate content blocks efficiently, etc.

Inactive Publication Date: 2017-07-11
WUHAN UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, for documents and data in the web space, if the similarity mining is carried out according to the general existing similarity detection method, it is difficult to efficiently extract the data because the structural characteristics of the web data and the distribution area characteristics of the string content are not fully utilized. Find similar content blocks

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A web data similarity detection method based on two-level filtering of structure and content
  • A web data similarity detection method based on two-level filtering of structure and content
  • A web data similarity detection method based on two-level filtering of structure and content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0097] In order to facilitate those of ordinary skill in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.

[0098] please see figure 1 , the technical solution adopted in the present invention is: a Web data similarity detection method based on two-level filtering of structure and content, on the basis of the traditional general similarity detection method, the characteristics of Web data structure and content distribution are excavated, Two-stage filtering is performed on the detected document set; the invention considers that the documents containing similar web data should be similar in structure at first, and if the structures of the two documents are very diff...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Web data similarity detection method based on two-stage filtration of a structure and content. On the basis of a traditional universal similarity detection method, the distribution characteristics of the structure and the content of Web data are dug out, and detected document sets are subjected to two-stage filtration; the first-stage filtration of the two-stage filtration is structure similarity filtration, wherein each Web document is modeled into a Tag tree structure so as to remove the document sets dissimilar in structure, the remaining documents are subjected to key content extraction, key content is expressed in the form of tuple vectors, and key messages are connected to generate character string sets; the second-stage filtration of the two-stage filtration is to conduct Trie tree structure modeling on the character string sets generated after the first-stage filtration, and similar character strings are connected to obtain a final result. Multiple experiments prove that by the adoption of the method, the efficiency of data similarity detection in the web field can be improved remarkably.

Description

technical field [0001] The invention belongs to the field of similarity detection of Web data structure and content, and in particular relates to a method for similarity detection of Web data based on secondary filtering of structure and content. Background technique [0002] Web data is usually stored and exchanged in the form of XML or HTML, and the data is organized through various tags. Due to the fast speed of web data transmission and the flexible format of html webpages, the chances of reprinting or changing the widely spread webpages are relatively high, and the data organization forms belonging to the same topic may be different from each other, but the similarity is high. Therefore, the detection and exclusion of similarity data is very meaningful to the fields of data classification and clustering, data mining, focused search engines, pattern recognition and article detection. [0003] Traditional Web data similarity detection methods mainly focus on the two fiel...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/93
Inventor 李石君吴岳廷范珊珊张健李宇轩
Owner WUHAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products