Similar web page duplicate-removing system based on parallel programming mode

A programming mode, web technology, applied in instruments, calculations, electrical and digital data processing, etc., can solve problems such as low accuracy and misjudgment, and achieve the effect of improving efficiency, avoiding judgment deviation, and high efficiency

Inactive Publication Date: 2010-02-10
HUAZHONG UNIV OF SCI & TECH
View PDF0 Cites 74 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method also has shortcomings. For example, different webpages on the same topic may have a large overlap rate in their keyword sets, but the content of the webpages is

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similar web page duplicate-removing system based on parallel programming mode
  • Similar web page duplicate-removing system based on parallel programming mode
  • Similar web page duplicate-removing system based on parallel programming mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] Below in conjunction with accompanying drawing and example the present invention is described in further detail.

[0031] Generally speaking, detection and deduplication of similar web pages includes the following steps: (1) first extract some features of the web page; (2) then encode or quantize the features for fast calculation; (3) then encode (4) Finally, if large-scale calculations are required, a high-performance algorithm must be used on a high-performance computing platform to achieve large-scale calculations. high speed requirements.

[0032] Such as figure 1 As shown, the similar webpage deduplication system based on parallel programming mode provided by the present invention includes a webpage content preprocessing module 100, a webpage feature vector extraction module 200, a webpage feature fingerprint calculation module 300, a webpage fingerprint online deduplication module 400 and a webpage fingerprint distribution Type batch deduplication module 500.

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a similar web page duplicate-removing system based on a parallel programming mode, comprises a web page content pre-processing module, a web page eigenvector extracting module,a web page feature fingerprint calculation module, a web page fingerprint on-line duplicate-removing module, a web page fingerprint distributed batch duplicate-removing module and a computing platformbased on specific distribution. The system can complete links of carrying out unified conversion of text content encoding, standardization of document structure, web page noise content abortion, thematic content analysis and identification of web pages, lexical segmentation of continuous text content, and the like on the web pages obtained by crawling of web crawlers, thereby forming eigenvectorswhich can present the web pages. Relative algorithms can be used to obtain web page fingerprints which present web page characteristics aiming at the vector. The system provided by the invention accurately and fast detects fully complete repetition or approximate repetition of the web page contents caused by site mirroring, web document transshipment, and the like on the condition of massive amount of data of Internet and completes corresponding repetition-removing works, thereby enhancing the storage efficiency of search engines and bringing better use experience for the search engines.

Description

technical field [0001] The invention belongs to computer Internet information retrieval and analysis technology, and specifically relates to a similar webpage deduplication system based on parallel programming mode. The system is an improvement to the existing similar webpage deduplication system, combining the existing webpage structure and theme Analysis technology, extracting feature vectors of webpages, using webpage fingerprint deduplication algorithm based on parallel mode, completing deduplication function of similar webpages in distributed system environment, improving the efficiency of search engine indexing module and retrieval module. Background technique [0002] With the unprecedented development of Internet technology and scale in recent years, more and more traditional resources are migrating to the Internet. Search engines have become the main tool for information retrieval by users today because of their powerful and convenient retrieval functions. However, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 李瑞轩丁益斌文坤梅陈珊珊辜希武卢正鼎靳延安郑鹏赵勇
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products