Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page structural data extraction method and system

A technology of structured data and web page structure, which is applied in the field of network information, can solve problems such as limiting the scope of users, failing to fully meet the requirements of intelligent data collection, and lack of methods for web crawlers, so as to achieve the effect of solving ever-changing structures

Inactive Publication Date: 2009-10-21
上海光华如新信息科技股份有限公司
View PDF0 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the production of these configuration files often requires personnel who are very familiar with web design, which requires the user's computer software knowledge and limits the scope of users.
[0007] 4. Different websites have different website structures such as channels, sections, depth, advertisements, and user attention information. In view of this situation, there is currently no web crawler development that allows users to choose independently, and then automatically collects relevant pages;
[0008] 5. A large number of javascript scripting languages ​​are used on web pages, and current web crawlers still lack effective methods for how to extract structured data controlled by javascript
[0009] It can be seen that with the continuous expansion of the application range of web crawlers, higher requirements are put forward for the structured data extraction of web pages, and the existing web crawler technologies and products cannot fully meet the needs of more intelligent data collection. Require

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page structural data extraction method and system
  • Web page structural data extraction method and system
  • Web page structural data extraction method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0042] like figure 1 Shown is a block diagram of the entire system. The human-computer interaction module is connected with the interested page recording module, and the user calls the interested page recording module through the human-computer interaction module; the human-computer interaction module is connected with the regular expression training module; the regular expression training module is connected with the javascript parsing module, When there is a javascript script in the training web page, the regular expression training module calls the javascript parsing module; the web page acquisition module is connected with the agent scheduling module, and when the web page collection needs to pass through an agent, the web page acquisition module calls the agent scheduling module; the web page acquisition module is connected with the structure T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a web page structural data extraction method which is characterized by comprising the following steps: choosing a training web page content set and extracting target structural data; training the training web page content set to obtain a regular expression matched with the target structural data; writing the regular expression in a configuration module; utilizing the configuration module to collect a web page; and extracting the structural data from the collected web page. The web page structural data extraction method and system can realize the structural data extraction of static web pages and dynamic web pages and are suitable for acquiring various types of website information and extracting the structural data.

Description

technical field [0001] The invention relates to a network information collection technology, in particular to a system and method for extracting structured data of web pages, and belongs to the technical field of network information. Background technique [0002] With the development of network information technology, the information of websites, forums, blogs and other web pages is getting larger and larger. Search engines, content analysis, public opinion analysis and other technologies all analyze and process this information. They all use the web crawler. a technology. A web crawler, also known as a web spider, is a data collection method that automatically analyzes web page connections and automatically obtains information and stores it locally. At present, not only search engines, but also many applications use web crawlers as the main source of data collection, such as intelligent analysis of network content; not only traditional keyword retrieval, but also structure...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张世永吴承荣谢剑锋
Owner 上海光华如新信息科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products