Method and device for computing similarity of webpages

A similarity calculation and similarity technology, applied in the field of data processing, can solve problems such as inability to solve different URLs, not very effective, not very mature, etc.

Inactive Publication Date: 2016-10-12
LETV HLDG BEIJING CO LTD +1
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Using URL deduplication can only do simple deduplication, and cannot solve the problem of similar content between different URLs; using link relationship to calculate similarity is not very mature, and link relationship is only information with a small weight in the web page, and the effect is not very good ;Using the calculation of the structural characteristics of the webpage can only solve the repetition of the exact same webpage structure, and the website generally has its own webpag...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for computing similarity of webpages
  • Method and device for computing similarity of webpages
  • Method and device for computing similarity of webpages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0055] It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are to distinguish two entities with the same name but different parameters or parameters that are not the same, see "first" and "second" It is only for the convenience of expression, and should not be construed as a limitation on the embodiments of the present invention, which will not be described one by one in the subsequent embodiments.

[0056] According to the first aspect of the present invention, a web page similarity calculation method with better similarity calculation effect is proposed. Such as figure 1 As shown, it is a schematic flowchart of an embodiment of the web page similarity calculati...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for computing similarity of webpages. The method comprises the following steps of obtaining webpage information of two to-be-compared webpages; respectively extracting content information, structure information and picture information from the webpage information of the two webpages; respectively extracting content feature vectors from the content information of the two webpages and computing the similarity of the content feature vectors; respectively extracting structure feature vectors from the structure information of the two webpages and computing the similarity of the structure feature vectors; respectively extracting picture feature vectors from the picture information of the two webpages and computing the similarity of the picture feature vectors; and computing to obtain the final similarity of the two webpages according to the similarity of the content feature vectors, the similarity of the structure feature vectors and the similarity of the picture feature vectors. According to the method and the device for computing the similarity of the webpages, which are provided by the invention, the similarity computing effect is better.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a method and device for calculating web page similarity. Background technique [0002] At present, there are a lot of duplicate web pages on the Internet, therefore, the possibility of crawling similar web pages from different websites is very high. For example, when the same piece of news appears, it will be reproduced, copied and disseminated in various forms on the Internet, resulting in a very high degree of similarity between the two webpages. Such highly similar webpages may be webpages on the same website or not. pages on the same website. When search engines collect web pages, they usually compare two web pages to see if they are similar, and deduplicate or aggregate web pages with high similarity. [0003] Existing similarity calculation methods include several methods: 1) use URL to remove duplicates; 2) use content to calculate similarity; 3) use link relatio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/986
Inventor 谭露
Owner LETV HLDG BEIJING CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products