Method, device, and electronic device for extracting target object in web page

A technology of target object and extraction method, applied in the field of data processing, which can solve the problems of wrong context information determination, inaccurate target object, duplication, etc.

Active Publication Date: 2021-09-03
上海携宁计算机科技股份有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, in the related art, the python module pandas is used to parse the tables in the webpage. When a merged cell appears, the data in the merged cell will be parsed into multiple repeated data, so that when the table is read, the read data will be There are repetitions, so when analyzing the read data and extracting the target object in the web page, due to the repeated data in the context of the target object, the judgment of the context information of the target object is wrong, and the extracted target object is inaccurate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, device, and electronic device for extracting target object in web page
  • Method, device, and electronic device for extracting target object in web page
  • Method, device, and electronic device for extracting target object in web page

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0076]

example 2

[0078]

example 3

[0080]

[0081] In sub-step 402, a two-dimensional array is created according to the data in the first row, and the cells of the subtable are traversed to obtain a two-dimensional array storing data in the two-dimensional table. The number of columns in the two-dimensional array is the number of columns in the first row of data.

[0082] In an example, the cells of the subtable are traversed to obtain the text attribute value and the column merge attribute value of the cell of the currently traversed subtable, and the two-dimensional C split cell data in the row data of the table data, C split cell data includes: a text attribute value and C-1 preset character strings, according to the C split cell data in the row data to obtain two dimension table data.

[0083] In one example, the maximum row number occupied by the cells of the currently traversed subtable is obtained according to the value of the row merge attribute, and if the total number of rows of the two-dimensiona...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention relates to the field of data processing, and discloses a method, a device, an electronic device, and a storage medium for extracting a target object in a web page. In the present invention, the sub-table of the webpage is obtained; wherein, the sub-table is a table without nested tables in the webpage; the two-dimensional Table data; wherein, each split cell data in the row data of the two-dimensional table data is a preset character string or text attribute value, and the number of the split cell data is based on the merged cell attribute value Determining; extracting the target object in the web page according to the two-dimensional table data. This embodiment reduces the duplication of data when reading merged cells, thereby improving the accuracy of entity extraction. In addition, the table data of the webpage is read with text attribute values, thereby ensuring the accuracy of the read values.

Description

technical field [0001] The embodiments of the present invention relate to the field of data processing, and in particular to a method, device, electronic device, and storage medium for extracting target objects in webpages. Background technique [0002] In reality, there are a large number of web pages, and there are various information representation forms for web pages in different websites or different web pages in the same website. Among them, the information in a large number of web pages exists in the form of tables of. In the related technology, when extracting the table in the webpage, the table is parsed into a nested list through the Python module pandas. pandas is a NumPy-based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large datasets. [0003] However, in the related art, the python module pandas is used to parse the tables in the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F40/295G06F16/951
CPCG06F16/35G06F16/951G06F40/295
Inventor 张浩波张学哲王小凤
Owner 上海携宁计算机科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products