Keyword based topic-focused web crawler design method

A design method, web crawler technology, applied in the direction of web data indexing, web data retrieval, computing, etc., to achieve the effect of increasing the number, improving crawling efficiency, and improving the universality

Active Publication Date: 2017-05-24
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF4 Cites 49 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Through the above research and analysis, it is found that there have been many researches on topic crawlers, but how to rationally use massive resource information, how to improve the topic relevance of crawled webpages, and filter out weakly relevant webpages still needs further research.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword based topic-focused web crawler design method
  • Keyword based topic-focused web crawler design method
  • Keyword based topic-focused web crawler design method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0098] figure 1 It is a specific implementation flow chart about the producer thread, and the specific steps are as follows:

[0099] (1) Configure the description information of the domain ontology and use it as a template for the subject crawler. The description information includes: subject keywords and crawling keywords.

[0100] (2) Determine the subject keyword set of "food safety", and obtain the foodsecure subject keyword table foodsecureWord.

[0101] In this implementation, Baidu, google, bing and 360 are used as search engines, and the theme is set as "food safety". Keywords related to food safety such as "production safety standards", "food exceeding the standard", "food additives" are stored in the database table foodsecureWord, which is the so-called process of manually selecting subject keywords. Then use these keywords as search keywords to search in the search engine, and the retrieved content is stored in the text file. Finally, after word segmentation and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a keyword based topic-focused web crawler design method. The method comprises the following steps: step (1), configuring a search URL of a topic keyword, and forming an initial seed hyperlink originalURL; step (2), according to the originalURL, searching and downloading web pages in a search engine, and extracting a preliminary field of news based on webpage contents; step (3), according to a topic correlation algorithm, obtaining the similarity between each news and the topic, keeping news fields relevant to the topic and putting the news fields in a public queue newsQueue, and filtering out news not relevant to the topic; step (4), downloading a webpage content of the next page according to a nextPage URL, extracting the nextPageURL and the relevant field in step (3), putting the relevant field into the public queue newsQueue, and repeating step (4) until there is no next page hyperlink nextPageURL; and step (5), taking out the URL from the newsQueue and handing the URL to a crawler processing thread, that is a consumer thread. The keyword based topic-focused web crawler design method provided by the invention improves the crawling efficiency of the topic-focused web crawler, and enhances the effectiveness of crawled URL resources.

Description

technical field [0001] The invention relates to the technical field of network information processing, in particular to a method for designing a theme web crawler based on keywords. Background technique [0002] With the development of the Internet, it has brought abundant information resources to people, but it has also brought threats to traditional search engines. The coverage of resources, the accuracy and relevance of search results have declined, and users' search has become increasingly difficult. increase. Therefore, the theme crawler search engine came into being and has developed rapidly in recent years. [0003] A web crawler is a program that automatically crawls web pages and extracts web page content, with the purpose of obtaining information resources from the Internet. Web crawlers are mainly divided into two categories: general crawlers and topic crawlers. A general web crawler is a general web crawler, which adopts a certain crawling strategy based on an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/9566
Inventor 陈端兵杨柳傅彦周俊临
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products