Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system

A pseudo-static and screening technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of reduced scanning efficiency and difficulty in identifying pseudo-static URLs

Active Publication Date: 2015-09-09
SHANGHAI CTRIP COMMERCE CO LTD
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The technical problem to be solved by the present invention is to overcome the defect that crawlers in the prior art are difficult to identify pseudo-static URLs, thus causing crawlers to extrac...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system
  • Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] refer to figure 1 As shown, the screening method of the pseudo-static URL of the present embodiment includes the following steps:

[0079] S 1 , obtaining a URL list to be tested with a plurality of URLs recorded;

[0080] S 2 , read a URL regular list, the URL regular list includes a number of regular expressions, and build a database;

[0081] S 3 1. Select a URL from the list of URLs to be tested and match the regular expressions one by one. If it matches successfully with any regular expression, execute S 8 , if it fails to match all regular expressions, execute S 41 ;

[0082] S 41 . Searching for URLs with the same path as the URL in the database as URLs with the same path, and using other URLs in the database as URLs with different paths;

[0083] S 42 1. Compare the parameters and parameter values ​​of the URL with all URLs with the same path one by one. If the comparison results with all URLs with the same path show that the parameters are not the same...

Embodiment 2

[0108] Compared with the screening method of Embodiment 1, the webpage crawling method of the present embodiment differs only in that:

[0109] The web page crawling method of the present embodiment also includes step S 9 and the step S carried out before the screening method is carried out 0 , and step S 3 It is different from Example 1.

[0110] S 0 It is: read the initial URL, and crawl the webpage text corresponding to the initial URL, identify whether there is a URL link dynamically generated by ajax or js, if not, directly extract the URL from it and add the extracted URL to the list of URLs to be tested , if so, use the QTWebkit engine to dynamically simulate browser behavior to capture dynamic URLs and add the captured URLs to the list of URLs to be tested, and then execute S 1 .

[0111] S 3 For: judge whether the URL list to be tested is empty, if so, execute S 9 , if otherwise select a URL from the list of URLs to be tested and match the regular expressions o...

Embodiment 3

[0115] like figure 2 As shown, the filtering system of the pseudo-static URL of the present embodiment includes:

[0116] URL list module 1, used to obtain a URL list to be tested with multiple URLs recorded;

[0117] Regular list module 2, for setting up a database and reading a URL regular list, this URL regular list includes several regular expressions;

[0118]The regular expression matching module 3 is used to select a URL from the URL list to be tested when the URL list to be tested changes and to match the regular expressions one by one, and if it matches successfully with any regular expression, the second Update the module, if it fails to match all regular expressions, then enable the URL path classification module, wherein when the URL list to be tested changes, it refers to the following two situations: the new URL regular list and the original URL regular list are read The URL is removed;

[0119] The URL path classification module 4 is used to search the datab...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a screening method and system of a pseudo-static URL (Uniform Resource Locator) and a webpage crawling method and system. The screening method comprises the following steps: obtaining a URL list to be detected; reading a URL regular list, and establishing a database; reading one URL to be matched with regular expressions one by one; classifying the URLs in the database according to paths; comparing the parameters and the values of the selected URL and the same-path URL to determine whether a zone bit is arranged or not; according to the similarity of the URL and the similarity of a webpage structure, determining whether the zone bit is arranged or not; and storing the URL into the database. The automatic identification of the pseudo-static URL is realized by a pseudo-static technology which is used by aiming at a website so as to filter a great quantity of repeated and useless pseudo-static URLs and extract the valuable URLs for safety detection, and crawling efficiency and crawling accuracy are improved.

Description

technical field [0001] The invention relates to a pseudo-static URL screening method and system, and a web page crawling method and system. Background technique [0002] With the rapid development of Internet technology, the era of static web pages has become history, and now the influence of the WEB2.0 model makes more and more websites rapidly change towards dynamic and interactive directions. Due to the increasing application of JS technology (JS is JavaScript, an object-based and event-driven client-side scripting language) and pseudo-static technology, crawlers based on traditional acquisition of web page source code have been unable to do what they want. [0003] Nowadays, there are more and more contents on large websites. In order to improve access speed and obtain good search engine optimization, most websites will use pseudo-static technology. Pseudo-static is relative to real static. Pseudo-static technology actually uses dynamic script processing methods, but th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 王笑天董晓琼罗启武
Owner SHANGHAI CTRIP COMMERCE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products