Regular expression based URL filtering method

A filtering method and text filtering technology, which are applied in special data processing applications, using information identifiers to retrieve web data, instruments, etc., can solve the problem that crawlers stop data acquisition, crawlers cannot obtain correct data, and crawler programs have low coverage and other issues to achieve the effect of accurate search results

Inactive Publication Date: 2016-02-03
孙燕群
View PDF9 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, most web crawler programs are based on the page structure to obtain data. By obtaining the web page document, the web page document is parsed into a DOM tree form. According to the rules of the DOM tree, all elements in the HTML document are represented by nodes, and the extraction is constructed according to the DOM tree. Rules for data extraction. During the data extraction process, due to the heterogeneity of web page information sources, in order not to lose the extraction accuracy, it is necessary to construct corresponding extraction rules for each website, so that the coverage rate of the crawler program is very low and extremely large. The possibility of obtaining network resources is limited; the network page acquisition technology based on the DOM tree can improve the efficiency of data acquisition and the utilization rate of system resources, but in the process of data extraction, it will depend on the specific label nodes of the page. Once the corresponding page structure Changes or new label node naming rules will not only prevent the crawler from obtaining correct data, but will further cause the crawler to stop data acquisition

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Regular expression based URL filtering method
  • Regular expression based URL filtering method
  • Regular expression based URL filtering method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] The present invention will be further described below.

[0021] figure 1 A flow chart of a URL filtering method according to an embodiment of the present invention is shown. see figure 1 , according to the URL filtering method of the present invention, comprising: Step 110, acquiring the URL to be captured and the page corresponding to the URL to be captured. The URL to be grabbed can be specified by the user, or the URL to be grabbed can be obtained through configuration files or scripts. Step 120, showing the web page corresponding to the URL to be captured to the user, and prompting the user to request the user to provide URL filtering rules and / or text filtering rules for filtering the URL, and use the filtered URL to perform web page crawling Pick. In response to the URL filtering rules provided by the user, URLs in the page are filtered based on the URL filtering rules provided by the user (step 130). As an example, the page obtained in step 110 may contain m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a regular expression based URL filtering method. The method comprises: step 1, obtaining a first URL required to be crawled, and crawling a page corresponding to the first URL; step 2, displaying text content of the page corresponding to the first URL and a plurality of second URLs, and prompting a user to input a URL filtering rule and a text filtering rule; step 3, in response to the URL filtering rule submitted by the user, filtering the plurality of second URLs by applying the URL filtering rule to obtain one or more third URLs; and step 4, adding the one or more third URLs into a crawling queue.

Description

Technical field: [0001] The invention relates to the technical field of network information processing, in particular to a method for obtaining a network crawling scheme by using a crawler program to capture a user-defined network crawler. Background technique: [0002] With the development of Internet technology, search engines have become an important way for people to obtain information. Existing search engines are all implemented based on a technology commonly known as a web crawler (Crawler). When a web crawler crawls, it is difficult to delete the valuable information you want. There are chrome-plated URL links in the crawled webpage. It is difficult to judge whether the link is a catalog page or a detailed page through the source code of the webpage. The working principle of the crawler is that the search engine regularly executes the web crawler program, starts from the specified initial URL list as the root of the search tree to access the webpage resources located...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/955G06F16/9535
Inventor 孙燕群
Owner 孙燕群
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products