Supercharge Your Innovation With Domain-Expert AI Agents!

Method and system for extracting web page data

A web page data and web page technology, which is applied in the field of web page data crawling, can solve the problems of crawling failure, complexity, and difficulty in dealing with complex and changeable crawling environments, and achieves the effect of improving adaptability and improving the success rate.

Active Publication Date: 2015-07-22
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In recent years, with the gradual advancement of the internationalization strategy of domestic Internet companies, search engines have gradually increased their requirements for cross-country crawling of webpage data. However, the problem of cross-country crawling of webpage data is very complicated. crawl, but not in other countries
The current solution is to crawl various countries in a unified computer room, which is difficult to cope with the complex and changeable crawling environment, resulting in a large number of crawling failures, hindering the effect of cross-country crawling web page data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting web page data
  • Method and system for extracting web page data
  • Method and system for extracting web page data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the objectives, technical solutions and advantages of the embodiments of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings.

[0023] figure 1 It is a flowchart of a method for grabbing webpage data according to an embodiment of the present invention, see figure 1 , The method includes:

[0024] S110: Select high-quality links that have not been crawled, where the high-quality links are links to webpages that meet the user's retrieval needs;

[0025] S120: Mark the network exit of the selected high-quality link;

[0026] In the embodiment of the present invention, for example, the network export includes, but is not limited to: CDN (Content Delivery Network) export and default export (for example, Hong Kong export) in the United States, Japan, Thailand, Brazil, etc. .

[0027] S130: According to the result of the marking, distribute the selected high-quality links to the correspondin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method and system for extracting web page data. The method includes the following steps that un-extracted high-quality links are selected, wherein the high-quality links are links pointing to web pages meeting searching requirements of users; network outlets are marked for the selected high-quality links; according to marking results, the selected high-quality links are allocated to the corresponding network outlets so that web page data extraction can be conducted. According to the technical scheme, the adaptability to complex and variable extraction environments can be improved, and therefore the success rate of cross-country web page data extraction is increased remarkably.

Description

Technical field [0001] The present invention relates to the field of communications, and more specifically, to a method and system for capturing webpage data. Background technique [0002] One of the basic functions of search engines is to realize the crawling of web page data. Search engines use a program (spider) to scan websites that exist on the Internet according to certain rules, and find webpages through the link address of the webpage: start from a certain page of the website, read the content of the webpage, and find other link addresses in the webpage , And then find the next web page through these link addresses, so the loop continues. In recent years, with the gradual progress of domestic Internet companies’ internationalization strategies, search engines have gradually increased their requirements for cross-country crawling of webpage data. However, the problem of cross-country crawling of webpage data is very complicated. For example, some sites can be located in o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 吕明
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More