Method and device for deduplication of web pages

A webpage and equipment technology, applied in the field of webpage deduplication methods and equipment, can solve the problems of low efficiency, inability to duplicate webpage deduplication, and inability to deduplicate webpages with the same text content, so as to save storage resources, improve retrieval experience, and improve retrieval efficiency. The effect of the experience

Active Publication Date: 2019-06-21
ADVANCED NEW TECH CO LTD
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But the disadvantage is that the text content structure information of the webpage is not used, and the duplicate webpage caused by reprinting cannot be deduplicated.
[0023] This algorithm converts the full-text comparison of two texts into the comparison of several words and sentences, which reduces the time complexity and space complexity of the algorithm to a certain extent, but it is not ideal for large-scale webpage deduplication, because its Finding long common subsequences can take a lot of time
[0024] The above analysis shows that the webpage deduplication algorithms in the prior art have their own advantages, but they also have shortcomings; among them, for example, the method of clustering is inefficient, and the method of excluding the same URL cannot deduplicate webpages with the same text content; There are also various defects in other methods, so the technical problem of deduplication of web pages is not well solved in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for deduplication of web pages
  • Method and device for deduplication of web pages
  • Method and device for deduplication of web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068] The purpose of this application is to provide a method and device for deduplication of web pages, which are used to effectively deduplicate web pages with the same content, save storage resources, and improve user search experience.

[0069] In order to achieve the above technical objectives, such as figure 1 As shown, Embodiment 1 of the present application provides a webpage deduplication method, which specifically includes the following steps:

[0070] Step 101, extracting the feature code of the webpage to be processed;

[0071] Specifically, before step 101, it also includes: determining the type of webpage obtained; if the webpage obtained is a themed webpage (that is, a webpage containing text content), the text content of the webpage obtained is uniformly edited , and use the edited webpage as the webpage to be processed.

[0072] Specifically, after it is determined that the obtained webpage is a themed webpage, since the editing formats of the text content o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present application discloses a webpage deduplication method and device. The method comprises: extracting a feature code of a to-be-processed webpage; converting the feature code into a key value, and searching in a storage space whether the key value exists; if the key value exists, determining whether a preset requirement is met between the number of characters of the to-be-processed webpage and the number of characters of a webpage corresponding to the key value in the storage space; and if a determination result is that the preset requirement is met, determining that the to-be-processed webpage duplicates. The webpage deduplication method and device disclosed by the present application are capable of implementing effective deduplication for webpages with same contents and moreover are capable of saving storage resources and improving retrieval experience of users.

Description

technical field [0001] The present application relates to the field of the Internet, in particular to a method and device for deduplication of webpages. Background technique [0002] In the current web search results, users often get redundant pages with the same content, which not only waste storage resources, but also bring a lot of inconvenience to users' search [0003] However, there are not many methods for deduplication of Chinese webpages at present, and the methods are not complete. Among them, the main methods for deduplication of Chinese webpages are: method based on clustering, method based on the same URL, method based on keyword position sequence, method based on feature The method of sentence extraction, etc.; the following is a brief analysis of this: [0004] 1. Clustering method [0005] Clustering is to divide a collection of objects into several classes, and the objects in each class are similar to each other, but not similar to objects of other classes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/958
Inventor 唐小棚游永胜
Owner ADVANCED NEW TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products