Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Network page efficient and accurate deduplication system based on cloud computing

A web page and cloud computing technology, which is applied in computing, web data retrieval, web data browsing optimization, etc., can solve problems such as inability to achieve efficiency and accuracy, reduce computing Hamming distance, and shorten the time it takes to achieve a fast solution. Heavy problems, excellent time complexity and space complexity, and the effect of improving computing speed

Pending Publication Date: 2021-02-02
扆亮海
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] The efficient and accurate deduplication system for web pages based on cloud computing provided by the present invention can not achieve high efficiency and accuracy for the similarity detection of massive web pages by the SimHash algorithm. Considering the limitation of time and space resources, the present invention is based on cloud The calculated webpage deduplication method has excellent time complexity and space complexity. Through the improvement and optimization of the cloud-based SimHash algorithm, the cost of calculating the Hamming distance is greatly reduced. The cloud-based webpage deduplication is practical, efficient, and easy to expand , Accurate and high-speed, solves the problem of rapid deduplication of massive web pages. It is a method of deduplication with significant innovation and outstanding advantages; design and implement an efficient and accurate deduplication system based on Hadoop for web page fingerprint comparison, Significantly less time spent and significantly more accurate results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network page efficient and accurate deduplication system based on cloud computing
  • Network page efficient and accurate deduplication system based on cloud computing
  • Network page efficient and accurate deduplication system based on cloud computing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062]下面结合附图,对本发明提供的基于云计算的网络页面高效精准去重系统的技术方案进行进一步的描述,使本领域的技术人员可以更好的理解本发明并能予以实施。

[0063]伴随网络技术的高速发展,作为网络信息载体的网页呈爆炸式增长,在这些海量网页中不乏大量转载、分享的重复信息,这些重复网页不仅影响搜索引擎的搜索效率和搜索精度,导致用户体验差,同时还大量浪费存储空间。提高搜索结果的准确性需要检测并除去重复网页,但海量网页去重面临最大的挑战,即准确检测到重复网页的同时还能保证较高效率。为此本发明提出针对大型搜索引擎在搜集海量网页时的相似度检测去重方法,考虑到当前网页结构的复杂性,大部分网页都是按照标准html规范编写的,但依然存在一定数量的网页,其网页结构、布局排版并不规范,有的甚至存在很多网页编辑器自动生成的无用的html标签,更有的标签只有开始标签没有闭合标签,导致网页结构混乱且难以被机器识别。因此本发明首先对网页结构进行前置处理,补齐未正常关闭的标签,去掉没有实际意义的标签元素,去掉跟网页主题内容无关的内容;对网页结构重新整理后,利用汉语分词工具,对网页内容作分词处理,去掉停用词;设计并实现基于Hadoop的网络页面高效精准去重系统进行网页指印比对,高效提升计算速度。

[0064]本发明提出基于云计算的网络页面高效精准去重方法,通过前置处理和汉语分词准确提取到表征网页内容的特征字符串,设计并实现基于Hadoop的网络页面高效精准去重系统,最后通过实验分析验证了系统的可行性、高效性和精准性;本发明的主要包括:

[0065]一是网页前置处理,鉴于网页内容的特殊性,同时为提高去重精度,对抓取到的网页前置处理,剔除导航条、广告、版权信息的噪声内容;利用汉语分词工具对网页分词并计算词频作为权重,使用停用词表除去无意义的停用词,处理完成后的特征词组和权重作为去重算法的输入;

[0066]二是基于云计算的网页去重方法改良优化,以去重精度较高的SimHash算法为原型,设计适用于MapReduce计算模型的网页去重改良优化算法,该算法能够高效精确识别微小改动的网页,并能借助Hadoop平台有效缩短计算时间;

[0067]三是云计算网页去重系统的设计实现,设计并实现基于Hadoop的网页去重系统,包括前置处理模块、汉语分词模块和网页精准去重核心模块,其中网页精准去重核心模块基于Hadoo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a network page efficient and accurate deduplication system based on cloud computing, and aims to solve the problems that most of web pages searched by an existing search engineare static web pages, due to the existence of a large amount of transshipment and plagiarism, the main content of a large number of web pages is repeated, and for the search engine, the repeated web pages virtually increase the burden of index storage, and meanwhile, more retrieval time can be consumed; the webpage deduplication system based on the Hadoop cloud platform is designed and realized bycombining an open source framework, other modules of a search engine can be better connected by adopting a mode of detecting and judging duplicate in real time after a spider program captures a webpage; and in a massive webpage collection stage, the network page efficient and accurate deduplication system based on cloud computing can preprocess the web pages in advance, then web page similarity detection and discovery are carried out, repeated web pages or web pages with high similarity are removed, and therefore index quality is improved, retrieval results are optimized, and good search experience is provided for users.

Description

technical field [0001] The invention relates to a system for accurately deduplicating webpages, in particular to an efficient and accurate deduplication system for webpages based on cloud computing, and belongs to the technical field of deduplication for webpages. Background technique [0002] With the rapid development of electronic communication and computer network technology, web sites have grown rapidly, and the number of web pages has reached hundreds of billions. It has become the biggest problem for users to find the information they care about from the massive information database. Search engines solve this problem very well. Search engines collect information from the Internet, organize and process the collected information, and then provide users with an easy-to-use retrieval system. Users only need to enter the content they care about through the retrieval system. Keywords can search for the desired information. [0003] Most of the webpages collected by existin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/9532G06F16/955G06F16/957G06F16/958
CPCG06F16/9532G06F16/955G06F16/957G06F16/958
Inventor 扆亮海刘文平
Owner 扆亮海
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products