Webpage table information extraction method and device

A form information and form technology, applied in digital data information retrieval, website content management, network data retrieval and other directions, can solve problems such as bad, multiple interference items, etc.

Active Publication Date: 2020-10-20
SHANGHAI ICEKREDIT INC
View PDF9 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the effect of extracting data from a composite table that combines a horizontal table (header in the first row) and a vertical table (header in the first column) is not good
Moreover, the above extraction method is mainly based on rules, and the information in the extracted table units is not cleaned, resulting in more interference items in the extracted content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage table information extraction method and device
  • Webpage table information extraction method and device
  • Webpage table information extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054]In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. It should be understood that the appended The drawings are only for the purpose of illustration and description, and are not used to limit the protection scope of the present invention. Additionally, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this disclosure illustrate operations implemented in accordance with some of the embodiments of the present invention. It should be understood that the operations of the flowcharts may be performed out of order, and steps that have no logical context may be performed in reverse order or concurrently. In addition, those skilled in the art may add one or more o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage table information extraction method and device, and relates to the technical field of data information processing. The method comprises the following steps: firstly, cleaning webpage data, and detecting whether a web table exists in the cleaned webpage data or not; then, when the web table exists, identifying the style of the web table, and extracting table information according to the style of the web table; and finally, identifying the extracted table information by adopting an entity identification model, and screening out entity objects included in the webtable. Table information is extracted according to the style of the web table, and the extracted information is more accurate; and in addition, the extracted information is further identified and cleaned through the entity identification model, so that interference information in the extracted information can be reduced.

Description

technical field [0001] The present invention relates to the technical field of data information processing, in particular, to a method and device for extracting web page form information. Background technique [0002] In the era of big data, there are massive open semi-structured and unstructured data on the Internet, among which semi-structured data such as webpage table data often have high value. However, the style of web tables is complex and there are many data interference items, which greatly increases the difficulty of information extraction. [0003] Existing methods for extracting web form data generally use a web form parser to obtain a DOM tree containing table tags, and then combine filtering rules for specific pages or manually mark the form data to extract the form. However, the effect of extracting data from a composite table that combines a horizontal table (header in the first row) and a vertical table (header in the first column) is not good. Moreover, t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/958G06F16/215G06F40/295G06F40/177
CPCG06F16/958G06F16/215G06F40/295G06F40/177Y02D10/00
Inventor 顾凌云陈波王健健
Owner SHANGHAI ICEKREDIT INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products