Structure recognition based Web table information extraction method

A technology of structure recognition and form information, applied in the field of Web information extraction, to achieve the effect of reducing the number of string matching, fast recognition, and reducing redundant data

Inactive Publication Date: 2015-11-11
PLA PEOPLES LIBERATION ARMY OF CHINA STRATEGIC SUPPORT FORCE AEROSPACE ENG UNIV
View PDF4 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to solve the problem of extracting form information in the Web, especially the information extraction strategy of complex forms

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Structure recognition based Web table information extraction method
  • Structure recognition based Web table information extraction method
  • Structure recognition based Web table information extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The present invention proposes a method for extracting Web form information based on structure recognition. This method can correctly extract table information on the basis of quickly and accurately identifying the table structure, and can effectively reduce the generation of redundant data in the extraction result. The complete process of the method is as Figure 5 Shown.

[0036] The operation of this method includes the following steps:

[0037] 1. Web form structure recognition

[0038] ① Heuristic rules (given a Web form)

[0039] Get the number of columns in the table, Get_Table.column.size();

[0040] If Table.column.size() is 2 or 3, and Table.row.size() is much larger than the number of columns (usually more than 2 times), the first column of the table is the attribute cell;

[0041] / / The same rule applies to tables where the number of columns is much larger than the number of rows, and the first row of the table is the attribute row.

[0042] For the form that does ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a structure recognition based Web table information extraction method. According to the method, table structure recognition is performed in two progressive ways. Firstly, a set of heuristic rules is used to determine structures of several common types of the tables, so as to define structure types of most Web tables; and tables not recognized by using the heuristic rules are processed by using a string matching method, and a matched cell is limited in a line or a column in which a ULC (upper-left-cell) is, so that content on which string matching needs to be performed is significantly reduced, thereby improving matching and recognition efficiency. Finally, in terms of two-dimensional tables, processing strategies of synthesizing cells in an information extraction process is proposed, which can reduce generated redundant data while ensuring that relationships between data in an extraction result are not damaged.

Description

Technical field [0001] The invention belongs to the technical field of Web information extraction, and can be used for the extraction and storage of table information in Web documents, and particularly has better processing capabilities for the information extraction problem of complex Web tables whose relationships between data are difficult to understand. Background technique [0002] Information extraction is an important research direction in the field of data mining. The massive existence of Web resources makes Web-oriented information extraction a current research hotspot in this field. Among the various forms of Web information, Tables are an important form of data expression in Web documents, which are usually used to organize the basic information and statistical data of the described objects. Because of the high value of these structured data, the study of tabular data extraction is of great significance. However, HTML markup language is mainly used to display data and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
CPCG06F40/12G06F40/14
Inventor 刘东朱鸿乔李新明邢维艳李艺李亢王寿彪饶磊闫雪飞于少波李强
Owner PLA PEOPLES LIBERATION ARMY OF CHINA STRATEGIC SUPPORT FORCE AEROSPACE ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products