SIMD optimization-based webpage duplication elimination and concurrency method

A web page, web page feature technology, applied in special data processing applications, instruments, electrical digital data processing, etc.

Inactive Publication Date: 2011-04-20
CENT SOUTH UNIV
View PDF1 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are many difficulties in the selection of the size and number of text blocks. The most complete text block is the full text of the

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • SIMD optimization-based webpage duplication elimination and concurrency method
  • SIMD optimization-based webpage duplication elimination and concurrency method
  • SIMD optimization-based webpage duplication elimination and concurrency method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] The main steps of the SIMD-based web page deduplication algorithm in this example are as follows:

[0079] 1. Web page text information extraction. This process is mainly to extract the effective information of the webpage;

[0080] 2. Shingle extraction. This process is mainly to extract the features of the webpage;

[0081] 3. Clustering. This process is mainly to reduce the number of comparisons and reduce time and space complexity;

[0082] 4. Fingerprint comparison. This process is mainly to find similar web pages and eliminate them.

[0083] Each step is described as follows:

[0084] 1 Web page text information extraction

[0085] The text structure of a web page mainly includes a physical structure and a logical structure. Physical structure refers to the composition of web pages, mainly including web page tags, web page titles, web content, article titles, advertisements and other information; logical structure mainly refers to the structure between par...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an SIMD (single instruction multiple data) optimization-based webpage duplication elimination and concurrency method, which comprises the following steps of: 1, extracting text information of webpages, namely extracting effective information of webpages; 2, extracting Shingle, namely extracting webpage characteristics and generating a Shingles set; 3, clustering to reduce comparison times and reduce time and space complexity; and 4, comparing fingerprints to find similar webpages and delete the similar webpages. The SIMD optimization-based webpage duplication elimination and concurrency method can ensure the precision rate and the recall ratio, and effectively improves the rate of detecting webpage similarity.

Description

technical field [0001] The invention belongs to the technical field of computer applications, and relates to a parallel method for deduplication of web pages based on SIMD optimization. SIMD (Single Instruction Multiple Data, Single Instruction Multiple Data Stream) is a method that uses one controller to control multiple processors, and performs the same operation on each of a set of data (also known as "data vector") at the same time. A technique for achieving spatial parallelism. In microprocessors, single instruction stream multiple data stream technology is a controller that controls multiple parallel processing micro-units, such as Intel's MMX or SSE and AMD's 3D Now! technology. technical background [0002] With the rapid development of computer science and network technology, the network has become an important way for people to obtain important information. At present, the biggest difficulty that search engines face is that the returned result sets contain a lar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 龙军张祖平袁鑫攀罗跃逸
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products