Method for extracting data of webpage table

A technology for table data and web pages, applied in the field of web page table data extraction, can solve the problems of inability to meet the real-time data, waste of time and energy, and error-prone search, and achieve the effect of improving flexibility, improving accuracy, and simplifying extraction methods.

Active Publication Date: 2011-11-23
FUJIAN STAR NET COMM
View PDF2 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This not only requires a huge workload, but also causes a waste of time and energy. At the same time, searchin

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting data of webpage table
  • Method for extracting data of webpage table
  • Method for extracting data of webpage table

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] refer to figure 1 As shown, a method for extracting web form data of the present invention comprises the following steps:

[0024] Step 10, read the webpage source code, analyze its webpage source code into the Document object of W3C according to the character encoding, obtain any two keywords in the webpage form;

[0025] Step 20. Depth-first traverse all nodes in the Document object, respectively obtain the first node to which the first keyword belongs, and the second node to which the second keyword belongs; specifically:

[0026] Step 21, obtain the root node root of the Document object, and record it as node;

[0027] Step 22, traverse each child node childNode of the node, and determine whether the childNode is a leaf node; if yes, obtain the value of the childNode, and turn to step 23, otherwise traverse each child node of the childNode, and there is still no key after the traversal is completed Word node, return the parent node of node, and continue to search ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for extracting data of a webpage table. The method comprises the following steps: 10, reading a webpage source code, analyzing the webpage source code into a Document object of W3C (World Wide Web Consortium) and acquiring any two keywords in the webpage table; 20, performing depth-first traversal on all nodes in the Document object and acquiring two nodes to whichthe two keywords belong respectively; 30, acquiring a common father node with unique attribute of the two nodes, and acquiring the positioning condition of the webpage table by utilizing the unique attribute; and 40, filtering the webpage source code by utilizing the data positioning condition of the webpage table, and extracting a webpage table with the same effect as webpage display. In the method, according to the any two keywords to be extracted in the webpage table and the required table row/column values, the table with the same effect as the original webpage display can be accurately and quickly extracted from the webpage which changes in real time, data of designated rows/columns is acquired, and the flexibility and accuracy of data extraction are improved.

Description

【Technical field】 [0001] The invention relates to the technical field of webpages, in particular to a method for extracting webpage form data. 【Background technique】 [0002] With the continuous development of webpage technology, the display effect and the amount of information contained in webpages are becoming more and more complex, and the structure and content of webpages are also updated in real time. In order to obtain specified table data from a web page, it is necessary to manually search for the location, label, and attributes of the table in a large number of lengthy web page source codes, so as to locate the source code corresponding to the table and obtain the table data. This not only requires a huge workload, but also causes a waste of time and energy. At the same time, searching in complicated web page codes is also prone to errors, and it cannot meet the real-time requirements of data. [0003] W3C is the abbreviation of World Wide Web Consortium in English,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 杨凡黄建雄林珊
Owner FUJIAN STAR NET COMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products