Automatic webpage table data extraction method and device

A table data, automatic extraction technology, applied in the fields of electronic digital data processing, special data processing applications, semi-structured data retrieval, etc. achieve high practical value

Inactive Publication Date: 2018-05-04
湖南星汉数智科技有限公司
View PDF7 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Purpose of the invention: To solve the technical problems of existing webpage form data extraction rules are simple, complex forms cannot be processed, universality is poor, and extraction efficiency is not good, to provide a method and device for automatically extracting webpage form data, which can realize multi-table Label nesting, and automatic extraction of complex web form data containing merged cells, without manual intervention, high extraction efficiency and high accuracy, can provide a strong guarantee for further data mining work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic webpage table data extraction method and device
  • Automatic webpage table data extraction method and device
  • Automatic webpage table data extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0058] Taking the list of tutors of a school of aerospace as an example, the automatic extraction method of web form data is explained.

[0059] Table 1 A list of tutors in the School of Astronautics of a certain school

[0060]

[0061] It can be seen from the source code of the webpage that the table is designed with 2 layers of table tag nesting. refer to figure 1 , the table data extraction process is described in detail below:

[0062] Step 1: Obtain the webpage content containing the table tag through jsoup or other webpage parsers, and parse the webpage content into a DOM tree structure.

[0063] Step 2: Layer the table data containing the Table tag in the DOM tree structure, and then filter layer by layer until the table data that needs to be processed is obtained; the specific process includes the following:

[0064] Step 2.1: use the outermost Table tag in the DOM tree structure as the first layer, use the nested Table tag in the first layer as the second layer...

Embodiment 2

[0098] Now take a web page character list as an example to illustrate the automatic extraction method of web page form data.

[0099] Table 2 list of characters

[0100] name

tom

Kity

Lucy

Tomas

Rome

Bloom

Age

30

23

34

37

35

31

Gender

male

female

female

male

male

male

[0101] As can be seen from the source code of the web page, the table has only one table tag. The following is a detailed description of the table data extraction process:

[0102] Step 1: Obtain the webpage content containing the table tag through html or other webpage parsers, and parse the webpage content into a DOM tree structure.

[0103] Step 2: Layer the table data containing the Table tag in the DOM tree structure, and then filter layer by layer until the table data that needs to be processed is obtained; the specific process includes the following:

[0104] Step 2.1: Use the outermost Table tag in the DOM t...

Embodiment 3

[0142] Now take a webpage land bidding form as an example to explain the method of automatic data extraction from the webpage form.

[0143] Table 3 Bidding form for a certain land

[0144]

[0145] As can be seen from the source code, the table contains table tags. The following is a detailed description of the table data extraction process:

[0146] Step 1: Obtain the webpage content containing the table tag through a webpage parser, and parse the webpage content into a DOM tree structure.

[0147] Step 2: Layer the table data containing the Table tag in the DOM tree structure, and then filter layer by layer until the table data that needs to be processed is obtained; the specific process includes the following:

[0148] Step 2.1: Use the outermost Table tag in the DOM tree structure as the first layer, the nested Table tag in the first layer as the second layer, and so on;

[0149] Step 2.2: Filter the table data containing the Table tag layer by layer from the outsid...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic webpage table data extraction method and device. The method includes: acquiring Table-label-containing webpage content, and analyzing the webpage content into a DOM tree structure; subjecting Table-label-containing table data in the DOM tree structure to layering, and screening layer by layer until processing-requiring table data are obtained; adopting a maximum tr number and td number of the processing-requiring table data to create a table matrix, subjecting the processing-requiring table data to traversal, and inserting the table data and redundant datainto the table matrix to form a new table matrix; reading the new table matrix line by line to obtain a header type and a body starting position; writing data of the new table matrix into a database by adoption of the header type and the body starting position. By the automatic webpage table data extraction method, problems of multiple table label nesting and merge cell containing complicated webtable data extraction are solved, the table data can be extracted quickly and accurately, and high practical value is achieved.

Description

technical field [0001] The invention relates to the field of Internet page processing, in particular to a method and device for automatically extracting webpage form data. Background technique [0002] Among the massive data released on the Internet, web table data is often of high value because it can express relational information concisely and effectively. However, the format of the web form is complex and diverse, which greatly increases the difficulty of information extraction. Existing methods for extracting web form data generally use a web form parser to obtain a DOM tree containing table tags, and then combine filtering rules for a specific page or manually mark the form data to extract the form. However, these methods often have problems such as too simple rules, only targeting specific web pages, and poor universality. At the same time, it is difficult to deal with tables with nested multi-table tags and complex tables with cell merging. [0003] For example, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/22G06F17/24
CPCG06F16/84G06F40/151G06F40/18
Inventor 赵青华王志超周维赫中翮曾琰王军武虹玲舒露
Owner 湖南星汉数智科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products