Method for constructing topic web crawler system

A technology of web crawler and construction method, which is applied in the field of construction of crawler part, which can solve the problems of data redundancy, low correlation, excessive data volume, etc., and achieve the effect of fast speed

Inactive Publication Date: 2011-05-25
HARBIN ENG UNIV
View PDF4 Cites 38 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because ordinary crawlers lack standards in the crawling process, it is easy to cause problems such as excessive data volume and data redundancy, resulting in the low correlation between the final results returned by search engines and user needs.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for constructing topic web crawler system
  • Method for constructing topic web crawler system
  • Method for constructing topic web crawler system

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0029] Step (1): For the topic to be crawled, define the initial description vector of the topic based on keywords, and set the weight of all components to 1; set the correlation threshold, and set the initial URL queue.

[0030] Step (2): The crawler obtains URLs from the initial URL queue for crawling, and obtains URLs in sequence.

[0031] Step (3): Perform text analysis on the selected URL. In view of the fact that URL anchor text has less information and the links around the webpage text appear in blocks, use the anchor text corresponding to all URLs in the link block where the URL is located to form an extended anchor text vector, and calculate the relationship between this vector and the topic vector Correlation anchor_score, which is used as the correlation of all links and topics in this link block.

[0032] The weight of the components in the extended anchor text vector is calculated using the TFIDF formula:

[0033] W ik =...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for constructing a topic web crawler system, which comprises the following steps of: (1) defining an initial description vector of a topic, setting an initial threshold value of the correlation degree and initializing a URL (Uniform Resource Locator) queue; (2) sequentially acquiring URLs from the initial URL queue to carry out crawl; (3) carrying out text analysis on the URLs; (4) carrying out link analysis on the URLs; (5) calculating the correlation degrees of the URLs with the topic; (6) adding the URLs of which the correlation degrees are more than the threshold value of the correlation degree into an ordered URL queue, sorting the URLs according to the correlation degrees of the URLs with the topic vector, carrying out crawl sequentially until the queue is empty, extracting a sub URL in each crawled webpage, and returning the step (3); (7) carrying out optimization of a genetic algorithm by using the genetic algorithm; and (8) updating the topic vector by a Rocchio feedback module, dynamically regulating the threshold value of the correlaton degree and continously crawling the webpages. In the method, a great amount of training texts do not need to be prepared in advance. The method has high speed and is suitable for processing an immense amount of on-line webpage data.

Description

technical field [0001] The invention relates to a construction method of a crawler part in a network data collection system, and mainly relates to a construction method of a subject network crawler system. Background technique [0002] With the advent of the information age and the rapid development of the network, the amount of information on the network is increasing exponentially. Facing the huge amount of information on the Internet, users usually use search engines to locate the network data they need. The current mainstream search engines are basically comprehensive search engines. Because the crawlers of comprehensive search engines do not crawl specifically for specific content, the results retrieved by users using comprehensive search engines often have many irrelevant or little relevance to their needs, and users need to browse many web pages to obtain useful content. Information. Web crawler is a core part of search engine, and its search technology greatly aff...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 宁慧吴昊谈亚洲吴悦吕志龙
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products