Webpage URL repetition elimination method based on distributed database
A database and distributed technology, applied in the field of distributed databases, can solve problems such as sacrificing accuracy, achieve low collision rate and solve memory problems.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0024] Such as figure 1 As shown, the webpage URL deduplication method based on the distributed database of the present invention includes the following steps, step S101: Obtaining the URL to be crawled, and the distributed crawler obtains the URL of the webpage to be crawled.
[0025] Step S102: Calculate the hash value of the URL; use the MurmurHash method to map the web page URL to a long-type hash value. The advantages of MurmurHash are high computing performance and low collision rate. In addition, the algorithm can also achieve data compression, thereby improving communication efficiency and saving storage space.
[0026] Step S103: Query the database, and the distributed crawlers compress the URLs in their collection databases and send them to the distributed database for deduplication processing. The database system in the present invention adopts a decentralized structure, and the main technical means for realization is consistent hashing.
[0027] The consistent h...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com