Automatic generating method of wrapper of complex page

An automatic generation and wrapper technology, applied in the fields of instruments, program control devices, special data processing applications, etc., can solve the problems of high skills, do not use large-scale web data integration, wrapper failures, etc., and achieve high extraction accuracy. Effect

Inactive Publication Date: 2009-08-26
束兰
View PDF0 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The current wrappers mainly have the following disadvantages: (1) The development and use of wrappers requires high skills and manual participation, and it takes a lot of time to study the structure of the web pages to be extracted
This approach does not take advantage of large-scale web data integration
(2) Since the wrapper is closely related to a specific data source, if the designer of the webpage changes the layout of the original webpage, the existing wrapper may become invalid

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic generating method of wrapper of complex page
  • Automatic generating method of wrapper of complex page
  • Automatic generating method of wrapper of complex page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] Embodiment one: see attached figure 1As shown in , the basic flow of the wrapper automation generation system is shown. The whole system mainly consists of three parts: Data-rich area (DS) identification sub-module, data record (DR) identification sub-module and wrapper generator sub-module.

[0040] Data-rich area (DS) identification sub-module, from the data point of view, DS is the collection of data records on the Web. The category list page includes not only the data record set area, but also areas such as advertisement bars and navigation bars. By comparing the Html Tag trees of two pages (here, list pages) generated based on the same module, the Data-rich area that the user is interested in can be quickly located. Because the list page is generated by a pre-defined template, DR often appears on the page in an iterative form. According to observation, it can be found that the vicinity of Data-rich is often accompanied by the appearance of paging navigation. We ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic generating method of a wrapper of complex pages. The method comprises the followings steps: (1) acquiring two HTML page documents based on the same template to generate an HTML Tag tree; (2) acquiring a minimum region DS containing a data record set; (3) acquiring initial data record (DR) from the minimum region; (4) recording the layout combination relation of the DR according to the initial data record, determining aggregation relation of extraction items according to the similarity of characteristic items, carrying out semantic annotation on entities in the same aggregation block in combination with the knowledge of the field, and recombining a new data record according to the semantic relation among entities; (5) generating the extraction rule of each aggregation block according to the position relation of the generated data record in step (4) in the HTML Tag tree, and then constructing the wrapper. The invention can extract the true data record rule from the complex pages through the analysis of the structural relation of the HTML Tag tree, thereby automatically constructing the wrapper with high extraction accuracy rate.

Description

technical field [0001] The invention relates to a method for identifying information of a Web page, in particular to an automatic generation method of a wrapper for extracting deep web page data information applied to complex pages. Background technique [0002] Most of the Web pages on the Internet are presented in the form of HTML, and the characteristics of HTML enable any organization and individual to publish information with various contents and rich forms on the Web according to their own ideas. The semi-structured or even unstructured state of Web data makes Web pages only suitable for human browsing, and it is not conducive to applications to directly parse and utilize the massive and valuable information on the Web. On the other hand, with the rapid development of the Internet and e-commerce, "information explosion" has become an obstacle for people to obtain information effectively. Therefore, it becomes more realistic and urgent to use computer to extract Web in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/44
Inventor 崔志明方巍赵朋朋
Owner 束兰
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products