Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Crawler seed obtaining method and equipment and crawler crawling method and equipment

An acquisition method and seed technology, applied in the field of search engines, can solve problems such as high time cost, poor coverage, and unsystematic screening and construction, and achieve the effects of reducing time cost, improving efficiency, and improving coverage

Inactive Publication Date: 2012-02-15
BEIJING XINWANG RUIJIE NETWORK TECH CO LTD
View PDF3 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In the prior art, since crawler seeds are usually manually pre-specified several URLs, there is no systematic strategy or solution for the screening and construction of crawler seeds, which leads to a long time in the case of searching the entire network. (Usually half a year or one year) to obtain a large number of mainstream URLs, and the coverage of mainstream URLs due to the limited number of crawler seeds is also poor. The cost is huge, and it is not easy to deploy and implement

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Crawler seed obtaining method and equipment and crawler crawling method and equipment
  • Crawler seed obtaining method and equipment and crawler crawling method and equipment
  • Crawler seed obtaining method and equipment and crawler crawling method and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0056]The search term dictionary in each embodiment of the present invention usually includes a large number of search terms, and each search term corresponds to a dynamic page request. A dynamic page request. The implementation of step 101 provided in this embodiment includes:

[0057] Step 1011, the crawler seed acquisition device first loads all search terms in the search term dictionary into its memory space.

[0058] Under normal circumstances, the search term dictionary is stored in an external storage space. In order to improve execution speed and efficiency, the crawler seed acquisition device preloads all search terms in the search term dictionary into its memory space.

[0059] Step 1012, the crawler seed acquisition device judges whether there are search words in its memory space; if the judgment result is yes, execute step 1013; if the judgment result is no, execute step 1016.

[0060] In this embodiment, each time the crawler seed acquisition device generates a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a crawler seed obtaining method and equipment and a crawler crawling method and equipment. The crawler seed obtaining method comprises the following steps of: establishing a dynamic page request according to the preset search term dictionary and the URL (uniform resource locator) characteristics of a target navigation website; sending the dynamic page request to a server of the target navigation website; according to the preset extraction policy, extracting the target URL from the search result page returned by the server according to the dynamic page request, wherein the target URL is the main domain name address of the URL in the search result page; and performing unique processing on the target URL to obtain the unique target URL, wherein the unique target URL is used as a crawler seed. Through the technical scheme of the invention, abundant crawler seeds with great dispersion can be provided, and thus the time for forming the mainstream URL is shortened, the coverage of the mainstream URL is improved, and the time cost for crawling of the crawler system is reduced.

Description

technical field [0001] The invention relates to search engine technology, in particular to a crawler seed acquisition method and equipment, and a crawler crawling method and equipment. Background technique [0002] Search engine refers to the use of specific computer programs to collect information from the Internet (Internet) according to certain strategies, and after organizing and processing the information, it provides users with retrieval services and displays the retrieval results related to users. to the user's system. [0003] At present, web crawling strategies can be divided into three types: depth-first, breadth-first, and best-first. Among them, depth-first will lead to crawler trapping (trapped) problems in many cases, and breadth-first and best-first methods are common at present. The breadth-first search strategy means that in the crawling process, the next level of search is performed after the current level of search is completed. The design and implement...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): H04L29/08G06F17/30
Inventor 吴滨华王祖海
Owner BEIJING XINWANG RUIJIE NETWORK TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products