Method for constructing topic web crawler system
A technology of web crawler and construction method, which is applied in the field of construction of crawler part, which can solve the problems of data redundancy, low correlation, excessive data volume, etc., and achieve the effect of fast speed
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
specific Embodiment approach
[0029] Step (1): For the topic to be crawled, define the initial description vector of the topic based on keywords, and set the weight of all components to 1; set the correlation threshold, and set the initial URL queue.
[0030] Step (2): The crawler obtains URLs from the initial URL queue for crawling, and obtains URLs in sequence.
[0031] Step (3): Perform text analysis on the selected URL. In view of the fact that URL anchor text has less information and the links around the webpage text appear in blocks, use the anchor text corresponding to all URLs in the link block where the URL is located to form an extended anchor text vector, and calculate the relationship between this vector and the topic vector Correlation anchor_score, which is used as the correlation of all links and topics in this link block.
[0032] The weight of the components in the extended anchor text vector is calculated using the TFIDF formula:
[0033] W ik =...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com