Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system
A pseudo-static and screening technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of reduced scanning efficiency and difficulty in identifying pseudo-static URLs
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0078] refer to figure 1 As shown, the screening method of the pseudo-static URL of the present embodiment includes the following steps:
[0079] S 1 , obtaining a URL list to be tested with a plurality of URLs recorded;
[0080] S 2 , read a URL regular list, the URL regular list includes a number of regular expressions, and build a database;
[0081] S 3 1. Select a URL from the list of URLs to be tested and match the regular expressions one by one. If it matches successfully with any regular expression, execute S 8 , if it fails to match all regular expressions, execute S 41 ;
[0082] S 41 . Searching for URLs with the same path as the URL in the database as URLs with the same path, and using other URLs in the database as URLs with different paths;
[0083] S 42 1. Compare the parameters and parameter values of the URL with all URLs with the same path one by one. If the comparison results with all URLs with the same path show that the parameters are not the same...
Embodiment 2
[0108] Compared with the screening method of Embodiment 1, the webpage crawling method of the present embodiment differs only in that:
[0109] The web page crawling method of the present embodiment also includes step S 9 and the step S carried out before the screening method is carried out 0 , and step S 3 It is different from Example 1.
[0110] S 0 It is: read the initial URL, and crawl the webpage text corresponding to the initial URL, identify whether there is a URL link dynamically generated by ajax or js, if not, directly extract the URL from it and add the extracted URL to the list of URLs to be tested , if so, use the QTWebkit engine to dynamically simulate browser behavior to capture dynamic URLs and add the captured URLs to the list of URLs to be tested, and then execute S 1 .
[0111] S 3 For: judge whether the URL list to be tested is empty, if so, execute S 9 , if otherwise select a URL from the list of URLs to be tested and match the regular expressions o...
Embodiment 3
[0115] like figure 2 As shown, the filtering system of the pseudo-static URL of the present embodiment includes:
[0116] URL list module 1, used to obtain a URL list to be tested with multiple URLs recorded;
[0117] Regular list module 2, for setting up a database and reading a URL regular list, this URL regular list includes several regular expressions;
[0118]The regular expression matching module 3 is used to select a URL from the URL list to be tested when the URL list to be tested changes and to match the regular expressions one by one, and if it matches successfully with any regular expression, the second Update the module, if it fails to match all regular expressions, then enable the URL path classification module, wherein when the URL list to be tested changes, it refers to the following two situations: the new URL regular list and the original URL regular list are read The URL is removed;
[0119] The URL path classification module 4 is used to search the datab...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com