Web data extraction method based on visual customization of extraction template

A data extraction and template technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of complex and difficult web data extraction tasks, achieve friendly user interaction capabilities, strong applicability, and improve extraction efficiency Effect

Active Publication Date: 2012-02-22
SHANDONG UNIV
View PDF2 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For some web data extraction tasks, attributes can be missing or a certain attribute in a record has multiple attribute values. In addition, when the semi-structured data in the web page has a non-unique attribute sequence or spelling errors, the web data extraction task will become more complex and difficult

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web data extraction method based on visual customization of extraction template
  • Web data extraction method based on visual customization of extraction template
  • Web data extraction method based on visual customization of extraction template

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0054] figure 1Among them, a web data extraction method based on the visual customization of the extraction template, which includes the following steps

[0055] A. Template page preprocessing;

[0056] B. Visual customization of the extraction template;

[0057] C. Set the page batch extraction frequency;

[0058] D. Page batch extraction.

[0059] Said step A template page preprocessing, that is, the conversion and display of the template page source code: it analyzes the DOM tree structure of the template page in the memory program by analyzing the HTML source code, and converts it into XML format, and displays it in the display displayed in the user interface.

[0060] The visual customization of the extraction template in step B refers to providing a drag-and-drop selection function on the user interface, and the user sets the corresponding re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Web data extraction method based on visual customization of an extraction template. The Web data extraction method comprises the following steps: A. pretreatment of template pages: converting and showing source codes of the template pages; B. visual customization of the extraction template: providing a drag selection function on a user interface, setting the corresponding relationship between attribute tags and data values on the template pages and attributes in a domain model by a user, and establishing the extraction template; C. setting of mass extraction frequency of the pages: extracting the crawled HTML (Hypertext Markup Language) pages in large quantity once every 8 hours; and D. mass extraction of the pages: extracting the crawled HTML pages in large quantity by the corresponding extraction template, converting semi-structured data into structured data and then storing the structured data in a local database.

Description

technical field [0001] The invention relates to the extraction of Web pages, which belongs to the field of computer applications, in particular to a method for extracting Web data based on the visualization and customization of an extraction template. Background technique [0002] With the rapid development of Internet technology, the number of websites and web pages on the Web has grown explosively, making the Web a huge and widely distributed data source. Text, tables, and multimedia files such as pictures and videos are the main manifestations of Web information. Web data extraction is to extract semantically consistent and structured numerical knowledge from Web data according to certain rules, and establish a numerical knowledge repository. , to meet user data query and data analysis needs. In order to automatically transform the input web pages into structured data, a lot of work has been done in the field of data extraction. Web data extraction is mainly used to gen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 李庆忠闫中敏彭朝晖蔡益清
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products