Automatic extraction method oriented to data of deep web pages

A technology of web page data and page data, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as low efficiency, low accuracy, difficulty in wrapper generation and maintenance, etc.

Inactive Publication Date: 2012-09-12
CHONGQING UNIV
View PDF2 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] At present, in the field of deep web page data extraction research, many automated or semi-automated data extraction tools have been developed. Among them, there is a method of obtaining web page data extraction wrappers by learning and using manually marked sample pages. Such methods require a lot of manual participation. , leading to a low degree of automation, difficult generation and maintenance of wrappers, and a data extraction method based on vision. This method overcomes the dependence of existing methods on the original HTML page to a certain extent, but due to the It is also very difficult to obtain accurate visual information due to semi-structured or unstructured features. At the same time, there is an automatic ...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic extraction method oriented to data of deep web pages
  • Automatic extraction method oriented to data of deep web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] Below in conjunction with accompanying drawing and embodiment the present invention will be further described:

[0041] Such as figure 1 As shown, a method for automatic extraction of deep web page data is carried out according to the following steps:

[0042] S1. Obtain two deep web pages of the same site, marked as page one and page two respectively; use the HTML Tidy conversion tool to convert the HTML documents of page one and page two into XHTML documents;

[0043] S2. Perform noise removal processing on page 1 and page 2;

[0044] S3. Perform duplicate mode elimination processing on page 1 and page 2;

[0045] S4, generating a web page data extraction wrapper;

[0046] S5, the page of the data to be extracted Perform noise removal processing;

[0047] S6. The web page data extraction wrapper first marks the pages after denoising in step S5, and then extracts the marked pages;

[0048] The repeated pattern elimination process described in step S3 is carried ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic extraction method oriented to data of deep web pages, and belongs to the field of computer data mining. The automatic extraction method includes obtaining two deep web pages of the same website at first, and respectively marking the two deep web pages as a first page and a second page; converting HTML (hypertext markup language) documents of the first page and the second page into XHTML (extensible hypertext markup language) documents; then removing noise of the first page and the second page; eliminating repeated modes of the first page and the second page to generate a webpage data extraction wrapper; removing noise of the page with the data to be extracted at first when the page is extracted; marking the page by the webpage data extraction wrapped after the noise of the webpage is removed, and finally extracting the marked page. By the aid of the automatic extraction method, efficiency of a repeated mode elimination algorithm and efficiency of a matching algorithm are improved, extraction complexity is reduced, the matching algorithm and an extraction algorithm, which are designed according to characteristics of the repeated mode elimination algorithm, in the method are simple and speedy in process, and data extraction accuracy is improved.

Description

technical field [0001] The invention belongs to the field of computer data mining, in particular to a method for automatically extracting page data facing deep webs. Background technique [0002] At present, in the field of deep web page data extraction research, many automated or semi-automated data extraction tools have been developed. Among them, there is a method of obtaining web page data extraction wrappers by learning and using manually marked sample pages. Such methods require a lot of manual participation. , leading to a low degree of automation, difficult generation and maintenance of wrappers, and a data extraction method based on vision. This method overcomes the dependence of existing methods on the original HTML page to a certain extent, but due to the It is also very difficult to obtain accurate visual information due to semi-structured or unstructured features. At the same time, there is an automatic data extraction method, which can automatically summarize t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 冯永王慧娟钟将周尚波李季
Owner CHONGQING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products