SIMD optimization-based webpage duplication elimination and concurrency method
A web page, web page feature technology, applied in special data processing applications, instruments, electrical digital data processing, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0078] The main steps of the SIMD-based web page deduplication algorithm in this example are as follows:
[0079] 1. Web page text information extraction. This process is mainly to extract the effective information of the webpage;
[0080] 2. Shingle extraction. This process is mainly to extract the features of the webpage;
[0081] 3. Clustering. This process is mainly to reduce the number of comparisons and reduce time and space complexity;
[0082] 4. Fingerprint comparison. This process is mainly to find similar web pages and eliminate them.
[0083] Each step is described as follows:
[0084] 1 Web page text information extraction
[0085] The text structure of a web page mainly includes a physical structure and a logical structure. Physical structure refers to the composition of web pages, mainly including web page tags, web page titles, web content, article titles, advertisements and other information; logical structure mainly refers to the structure between par...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com