Repeated data deletion framework-based reverse index representation method and system

A technology of data deduplication and inverted index, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve coding problems, achieve the effect of reducing the number and improving the compression rate

Active Publication Date: 2016-12-07
NANKAI UNIV
View PDF2 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] The purpose of the present invention is to solve the problem that the existing inverted index compression method needs to encode the serial number of each document, and to provide a new type of inverted index representation method a

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Repeated data deletion framework-based reverse index representation method and system
  • Repeated data deletion framework-based reverse index representation method and system
  • Repeated data deletion framework-based reverse index representation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] Inverted index representation method based on data deduplication architecture, its process see figure 1 . For an inverted index representation system implementing the described method, see figure 2 .

[0038] We call the sequence of document serial numbers with continuous values ​​in the inverted list a sequence interval. For example, the sequence {10,11,12,13,14} can be called a sequence interval, while the sequence {10,11,13,14} is Contains two sequence intervals, the first sequence interval is {10,11}, and the second sequence interval is {13,14}. Through observation, we find that there are a large number of such sequence intervals in the inverted list, so in the present invention, we propose two strategies for identifying repeated document sequences between different lists: C1. identify any repeated document sequences; C2. only Identify repeating sequence intervals. For the case where the sequence pattern is a sequence interval, we use the run-length representa...

Embodiment 2

[0073] We compared the number of bits required for each document serial number and the corresponding decompression speed after various forms of index encoding on the TREC GOV2 dataset, where EF represents the inverted index representation based on the optimal segmentation strategy and Elias-Fano encoding Method; TD represents the inverted index based on the traditional d-gap, R represents the index representation based on the deduplication architecture (I and II represent the repeated sequence identification strategy adopted, corresponding to the strategy C1 and strategy C2 described above respectively) . The inverted index data set used is described as follows:

[0074] (1) TREC GOV2 is a data set captured from the .gov domain name in 2004, including more than 25 million web pages;

[0075] (2) We use the TREC 2009 query set as the query test set, which contains a total of 32,244 queries, and is used to test the average decompression speed of various forms of indexes for the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a repeated data deletion framework-based reverse index representation method and system, and the method and system are suitable for search engines and social network data processing. The method comprises the following steps of: 1, traversing reverse lists in reverse indexes, and recognizing and recording sequential patterns which repeatedly emerge between different reverse lists; 2, calculating the length of each sequential pattern, carrying out corresponding operations according to the lengths, and distributing a mode serial number for each sequential pattern according to a lexicographical order of the sequential patterns; 3, reducing the reverse indexes according to the sequential patterns, and respectively storing the sequential patterns and the reduced reverse lists; and 4, difference value processing: carrying out difference value calculation on adjacent document serial numbers in the sequential patterns, and recording pattern serial numbers and position offsets of adjacent pattern serial numbers, wherein the pattern serial numbers are expressed as two-tuples. The method and system disclosed by the invention can effectively delete the repeated data in the reverse indexes, thereby decreasing the amount of the document serial numbers, improving the compression rate of the reverse indexes, shortening the query response time of the search engines and improving the user experience.

Description

technical field [0001] The invention belongs to the technical field of inverted index compression for search engines, and in particular relates to an inverted index representation method and system based on a data deduplication architecture. The invention is also applicable to the data compression problem and query problem based on the community network graph. Background technique [0002] Inverted index is the most widely used data structure in modern search engines, which consists of two parts: dictionary and inverted list. The dictionary saves the term obtained after processing the document collection, the document frequency of the term, and a pointer to the posting list corresponding to the term; the posting list is composed of multiple posting records, and each posting The record corresponds to a document containing the term, and the information recorded in the posting record includes: document serial number (called docID), term frequency (the number of times the term ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/2228
Inventor 刘晓光张曌华梁津李天龙童健聪黄海兵王刚
Owner NANKAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products