Webpage-crawling-based crawler technology

A crawler technology and web page technology, applied in the field of crawler technology based on web page crawling, can solve the problem that general search engines cannot be customized to provide search services, etc.

Inactive Publication Date: 2014-08-06
BEIJING INFCN INFORMATION TECH
View PDF3 Cites 64 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to provide a crawler technology based on web page crawling in order to overcome the shortcomings in the prior art. The content cannot meet the technical problems of user needs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage-crawling-based crawler technology
  • Webpage-crawling-based crawler technology

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings.

[0028] The invention is based on the Internet objects set by the user, and according to the tasks created by the user, the corresponding resources are crawled from the Internet, URLs are rewritten, and Internet information is collected and stored in a targeted manner.

[0029] See figure 1 This figure shows the process of a crawler technology based on web page crawling provided by the present invention. For ease of description, only relevant parts of the present invention are shown.

[0030] A crawler technology based on web page crawling. After initializing the URL link address, it includes the following steps:

[0031] 1) A balanced distribution of crawler threads starts from the given URL entry and reads the URL link address at the head of the queue in the run queue...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of technology, in particular to a webpage-crawling-based crawler technology. After URL (uniform resource locator) link addresses are initiated, the technology comprises the following steps: (1), reading the URL link addresses at the head of a running queue in the queue from a given access by using an equilibrium assignment crawler thread; (2), judging whether the URL link addresses exist or not, stopping crawling if the URL link addresses exist, otherwise crawling and placing the URL link addresses in a completion queue; (3), extracting webpages corresponding to the URL link addresses which are placed in the completion queue; (4), filtering the URL link addresses in the extracted webpages, keeping and writing effective URL link addresses into the running queue, and returning to the step (1) to repeat the steps. According to the technology, corresponding resources are crawled from the Internet, and the URL link addresses are rewritten and stored to pertinently acquire Internet information based on objects set by users according to tasks created by the users; in addition, multi-machine parallel crawling, multi-task scheduling, continuous crawling from a breakpoint, distributed crawler management and crawler control can be implemented.

Description

Technical field [0001] The invention relates to the technical field of Internet information collection, in particular to a crawler technology based on web page crawling used for Internet information collection in the fields of archives, libraries, cultural centers and the like. Background technique [0002] With the rapid development of the Internet, the World Wide Web has become a carrier of large amounts of information. As a tool to assist people in retrieving information, search engines have become the entrance and guide for users to access the World Wide Web. The current general search engines have certain limitations in information acquisition, that is, they cannot provide service customization, such as those in different fields and different backgrounds. Users often have different retrieval purposes and requirements, and the results returned by general search engines contain a lot of content that users do not care about. Since commercial search engines provide services to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 尹科
Owner BEIJING INFCN INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products