Similar web page duplicate-removing system based on parallel programming mode

A programming mode and webpage technology, which is applied in the direction of instruments, calculations, electrical digital data processing, etc., can solve problems such as misjudgment and low accuracy, and achieve the effects of avoiding judgment bias, improving efficiency, and optimizing space for time

Inactive Publication Date: 2011-04-20
HUAZHONG UNIV OF SCI & TECH
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method also has shortcomings. For example, different webpages on the same topic may have a large overlap rate in their keyword sets, but the content of the webpages is not repeated, and a large number of misjudgments may occur in the judgment of such webpages. , that is, non-similar pages are considered similar, so the accuracy is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similar web page duplicate-removing system based on parallel programming mode
  • Similar web page duplicate-removing system based on parallel programming mode
  • Similar web page duplicate-removing system based on parallel programming mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be described in further detail below in conjunction with the accompanying drawings and examples.

[0031] Generally speaking, the detection and deduplication of similar web pages includes the following steps: (1) first extract some features of the web page; (2) then encode or quantify the features for fast calculation; (3) then use the encoding (4) Finally, if a large-scale calculation is required, a high-performance algorithm must be used based on a high-performance computing platform to achieve large-scale calculation. high speed requirements.

[0032] like figure 1 As shown, the similar webpage deduplication system based on the parallel programming mode provided by the present invention includes a webpage content preprocessing module 100, a webpage feature vector extraction module 200, a webpage feature fingerprint calculation module 300, a webpage fingerprint online deduplication module 400, and a webpage fingerprint distribution module. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a similar web page duplicate-removing system based on a parallel programming mode, comprises a web page content pre-processing module, a web page eigenvector extracting module, a web page feature fingerprint calculation module, a web page fingerprint on-line duplicate-removing module, a web page fingerprint distributed batch duplicate-removing module and a computing platform based on specific distribution. The system can complete links of carrying out unified conversion of text content encoding, standardization of document structure, web page noise content abortion, thematic content analysis and identification of web pages, lexical segmentation of continuous text content, and the like on the web pages obtained by crawling of web crawlers, thereby forming eigenvectors which can present the web pages. Relative algorithms can be used to obtain web page fingerprints which present web page characteristics aiming at the vector. The system provided by the invention accurately and fast detects fully complete repetition or approximate repetition of the web page contents caused by site mirroring, web document transshipment, and the like on the condition of massive amount of data of Internet and completes corresponding repetition-removing works, thereby enhancing the storage efficiency of search engines and bringing better use experience for the search engines.

Description

technical field [0001] The invention belongs to computer Internet information retrieval and analysis technology, and in particular relates to a similar web page deduplication system based on a parallel programming mode. The analysis technology extracts the feature vector of web pages, uses the parallel mode-based web page fingerprint deduplication algorithm, completes the deduplication function of similar web pages in a distributed system environment, and improves the efficiency of the search engine index module and retrieval module. Background technique [0002] With the unprecedented development of Internet technology and scale in recent years, more and more traditional resources are migrating to the Internet. Search engines have become the main tool for information retrieval for today's users because of their powerful and convenient retrieval functions. However, due to the large scale of the Internet and geographical access restrictions, many websites use server mirroring...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 李瑞轩丁益斌文坤梅陈珊珊辜希武卢正鼎靳延安郑鹏赵勇
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products