Multi-record type dynamic webpage information extraction method based on visual block

A webpage information and extraction method technology, applied in digital data information retrieval, website content management, network data retrieval, etc., can solve problems such as data extraction, complexity of webpage layout, and low accuracy of webpage layout

Active Publication Date: 2019-08-02
ZHEJIANG UNIV OF TECH
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, if the popular JavaScript web application is used, most of the page components will mount the DOM nodes after the virtual DOM tree is created, and it is impossible to rely solely on HTML to extract web page information based on HTML documents or DOM information. Document source code for data extraction
[0006] 4. The complexity of web page layout
There are relatively mature methods for single-record webpage information extraction, but the accuracy of existing methods for extracting complex webpage layout webpages is still not high, so it is necessary to use an algorithm to remove irrelevant content, and then extract valuable content. Information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-record type dynamic webpage information extraction method based on visual block
  • Multi-record type dynamic webpage information extraction method based on visual block
  • Multi-record type dynamic webpage information extraction method based on visual block

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0067] The main purpose of web page data record extraction is to obtain effective data records from different web pages. The present invention creates a four-layer multi-record dynamic web page data extraction model, such as figure 1 As shown, and according to this data model, a multi-record dynamic web page data extraction scheme is proposed, and its process state diagram is as follows figure 1 shown.

[0068] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0069] figure 2 It is a flowchart of a method for extracting dynamic multi-record web page information according to one aspect of the present invention. As shown in the figure, the method includes the following steps:

[0070] Step1: Web page parsing and rendering;

[0071] First determine the target webpage, and obtain the link address of the target webpage. Through the browser kernel or interface, parse and render the target webpage to obtain its visu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-record type dynamic webpage information extraction method based on a visual block. The method comprises the following steps: step 1, analyzing and rendering a webpage;step 2: constructing a visual block and a visual block tree; step 3, preprocessing the page; step 4, identifying data recording block; and step 5, extracting a webpage data record. The method has theadvantages that data extraction of the dynamic multi-record type webpage can be completed without comparison and reference of a plurality of pages of the same website, and meanwhile the accuracy rateand the precision rate are maintained at a high level; according to the invention, after a certain website is trained, different unknown websites can be generalized.

Description

technical field [0001] The invention relates to a web page visual block construction and a method for extracting dynamic multi-record web page information. Background technique [0002] The Internet has become one of the most abundant sources of data and information, including a large number of static and dynamic web pages, and the number of web pages is growing explosively. How to efficiently extract data from the deep web composed of these web pages is still a challenging problem. Existing methods have solved the information extraction of most single-record webpages, but they have their own limitations for multi-record dynamic webpages, such as infinite samples, semi-structured webpages, dynamic content, and layout. Complexity, etc., the following points are explained: [0003] 1. Web pages are endless. In the foreseeable time, the number of websites presents a trend of information explosion. If the commonality of these web pages cannot be efficiently excavated, the da...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/9532G06F16/35G06F16/958
CPCG06F16/35G06F16/9532G06F16/972
Inventor 梁朝凯闵勇
Owner ZHEJIANG UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products