Website page source code automatic crawling method

A source code and webpage technology, applied in the field of web crawlers, can solve problems such as the inability to crawl webpage source codes smoothly

Pending Publication Date: 2018-10-16
SUN YAT SEN UNIV
View PDF7 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These means often prevent us from successfully crawling the source code of the website's web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Website page source code automatic crawling method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The present invention will be further described below in conjunction with specific embodiment:

[0027] A method for automatically crawling the source code of a website webpage described in this embodiment comprises the following steps:

[0028] S1. Determine the website containing the target information, analyze the website to determine the webpage where the target information is located, and the unique common characteristics of these webpages containing the target information;

[0029] S2, load the initial webpage crawled and the URL of the webpage; (URL is the uniform resource locator, which is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet .Every file on the Internet has a unique URL, which contains information indicating the location of the file and how the browser should handle it.Every web page on the website is a file, and they are all stored...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a website page source code automatic crawling method. The website page source code automatic crawling method comprises the following steps of: crawling a webpage in adetermined website to enable the crawled webpage be more centralized and with obvious common features, so as to the webpage can be conveniently crawled in a compiling crawler program; crawling a webpage in a specific website to enable the target information to be crawled be more centralized, so as to the needed information can be completely and quickly crawled. When crawling a website page sourcecode, the crawl can be effectively counterfeited as a webpage request issued by a browser to prevent being recognized by the website as the website data is crawled by a machine code; Setting a certainwaiting time to enable the code to report an error and stops running when the website or the network occurs an abnormal situation and does not respond to the crawler program, so as to the code can beautomatically run to crawl the webpage source code for a long time; Adding a proxy IP address database to effectively prevent access being denied by the website when the IP of crawler code is blocked, the program can also automatically change the IP to continue crawling the webpage source code.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a method for automatically crawling source codes of website webpages. Background technique [0002] With the rapid development of Internet technology, the information data on the network is growing explosively. This makes it more and more difficult to find the information data we need on the Internet. How to statistically analyze these diverse and real-time data to obtain valuable information behind the data is very meaningful. It is against this background that big data technology has developed rapidly in recent years and has become more and more widely used in various industries. How to obtain and store data on the network is particularly important in order to use a large amount of data to analyze information. [0003] At present, when people look for some data, most of them search through search engines and then browse directly on the website. Although this method is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杨智陈锭敏
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products