Lightweight common webpage topic crawler method based on search engine

A search engine and theme crawler technology, applied in the field of information retrieval, can solve the problem of web crawling accuracy and high implementation cost, and achieve the effect of low cost and easy implementation.

Inactive Publication Date: 2013-09-18
FOCUS TECH
View PDF4 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

On the one hand, it is to find the crawling cost of webpages in a specific field. In order to find webpages in a specific field, it is necessary to filter a large number of webpages. This crawling method is based on general crawling, and the implementation cost is too high.
On the other hand, it is to find the crawling accuracy of web pages related to specific topics, that is, for crawled web pages, it is necessary to be able to more accurately determine whether they are web pages of specific topics

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Lightweight common webpage topic crawler method based on search engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.

[0021] see figure 1 , the present invention provides a novel search engine-based lightweight webpage theme crawling method, comprising the steps of:

[0022] (1) Given a small amount of vocabulary that describes a specific topic as seeds, such as the abbreviation and full name of a commodity, etc., and constructing seed expansion rules in this field, such as the seed of a commodity can be expanded into a series of seeds through brand rules, an academic The seed of the meeting can be expanded into a series of seeds by year;

[0023] (2) According to the expanded seeds, convert the seeds into query words, and obtain several candidate websites rela...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A lightweight common webpage topic crawler method based on a search engine comprises the steps as follows: initializing seeds, namely extending a small amount of given related seeds describing specific topics to a series of seeds according to a certain rule; discovering websites, namely converting the initialized seeds into query words, and obtaining a plurality of related websites through an interface of the search engine; downloading the websites, namely downloading the related websites to a local machine and storing the related websites into a database; analyzing web pages, namely analyzing the downloaded websites to obtain link information in the websites; updating the seeds, namely analyzing crawled new websites, extracting topic related words from the crawled new websites, creating new seeds by the topic related words, and guaranteeing a crawling process to be continuously proceeded; updating the crawling process, namely, calculating a re-crawling cycle according to the update information of the crawled websites, and enabling the crawled websites to be automatically updated in a self-adaption way. The method has the characteristics of low cost, simplicity for realization, high efficiency, accuracy and the like.

Description

technical field [0001] The invention relates to the field of information retrieval, in particular to a webpage theme crawler in information retrieval. Background technique [0002] Massive information on the World Wide Web continues to grow and update rapidly, and timely collection of this massive information base has always been a basic problem in the research and application of information retrieval. Web crawlers are the classic technology to solve this problem. In many cases, people only need to search for Web sites in specific fields or topics, and the crawler technology that accomplishes this task is called topic crawler. There are a large number of websites in different fields, and the automatic crawling of websites in these fields is the basis for establishing vertical search engines in this field and applications such as data mining and analysis in specific fields. [0003] Although there is a certain link relationship between websites in a specific field, it also d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 姚瑞波周凤波翁强
Owner FOCUS TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products