Unlock instant, AI-driven research and patent intelligence for your innovation.

Pretreatment method for compressing inverted index

A technology of inverted indexing and preprocessing, which is applied in electrical digital data processing, special data processing applications, instruments, etc., and can solve the problems of low efficiency and inappropriate parallel decompression.

Active Publication Date: 2011-06-01
NANKAI UNIV
View PDF5 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0019] The purpose of the present invention is to provide a new type of inverted index based on linear regression for the existing d-gap preprocessing method based on the parallel decompression efficiency of the inverted index compression method is low, not suitable for combination with the set merge method compression preprocessing method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Pretreatment method for compressing inverted index
  • Pretreatment method for compressing inverted index
  • Pretreatment method for compressing inverted index

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0047] refer to Figure 4 , showing the first embodiment of the preprocessing method for inverted index compression of the present invention, the specific steps are as follows:

[0048] Step S401, for each posting list, use the index x of docID i is the abscissa, the value y i Make a two-dimensional scatter plot for the ordinate, x i 、y i All are non-negative integers, where i=1,...,n, n are positive integers, and a linear regression line y=f(x)=α+βx is generated based on the least squares method, in so that all points in the graph (x i ,y i ) to the vertical deviation y of the line i -f(x i ) sum of squares Minimum, get a list of vertical deviations equivalent to the posting list. This process is called linear regression. Obviously, it is only necessary to calculate the slope, intercept and vertical deviation list offline and save them to a file, and the corresponding inverted list can be calculated based on them when decompressing online, that is to say, th...

Embodiment 2

[0064] refer to Figure 5 , showing the second embodiment of the preprocessing method for inverted index compression of the present invention, the specific steps are as follows:

[0065] Step S501, for each posting list, the index x of docID i is the abscissa, the value y i Make a two-dimensional scatter plot for the ordinate, x i 、y i All are non-negative integers, where i=1,...,n, n are positive integers, and a linear regression line y=f(x)=α+βx is generated based on the least squares method, in so that all points in the graph (x i ,y i ) to the vertical deviation y of the line i -f(x i ) sum of squares Minimum, get a list of vertical deviations equivalent to the posting list.

[0066] Step S502, for each vertical deviation list, all vertical deviations y i -f(x i ) is rounded up and recorded as Obtains a list of integer deviations equivalent to this list of vertical deviations.

[0067] Step S503, for each integer dispersion list, if the integer disper...

Embodiment 3

[0076] refer to Image 6 , showing the third embodiment of the preprocessing method for inverted index compression of the present invention, the specific steps are as follows:

[0077] Step S601, for each posting list, use the index x of docID i is the abscissa, the value y i Make a two-dimensional scatter plot for the ordinate, x i 、y iAll are non-negative integers, where i=1,...,n, n are positive integers, and a linear regression line y=f(x)=α+βx is generated based on the least squares method, in so that all points in the graph (x i ,y i ) to the vertical deviation y of the line i -f(x i ) sum of squares Minimum, get a list of vertical deviations equivalent to the posting list.

[0078] Step S602. Divide each vertical deviation list into segments of equal length. Here, in order to achieve a better compression ratio, the segment length s is generally taken as a power of 2, such as 128, 256.

[0079] Step S603, for each segment of each vertical deviation lis...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a pretreatment method for compressing an inverted index, which comprises the following steps: for each inverted list, generating a linear regression line based on the least square method by using docID indices as horizontal coordinates and using values as longitudinal coordinates to draw a two-dimensional scatter diagram, and ensuring that the quadratic sum of vertical dispersion from each point in the diagram to the line is the minimum so as to obtain a vertical dispersion list equivalent to the inverted list; for each vertical dispersion list, rounding up all the vertical dispersions to obtain an integer dispersion list equivalent to the vertical dispersion list; and for each integer dispersion list, calculating the minimum value, and simultaneously subtracting the minimum value from all the integer dispersions to obtain a nonnegative integer dispersion list equivalent to the integer dispersion list. Based on the compression algorithm provided by the invention, a higher compression ratio is achieved, the parallel decompression efficiency is improved, and a set merging method can be combined better.

Description

【Technical field】 [0001] The invention relates to the field of inverted index compression, in particular to a preprocessing method for inverted index compression. 【Background technique】 [0002] The most widely used data structure in full-text search engines is the inverted index. An inverted index consists of two main parts: a dictionary and an inverted list. The dictionary establishes a one-to-one correspondence between keywords and postings, and a keyword postings is composed of a series of basic units called postings. Given a keyword, its post may contain information such as the document identifier (called docID), frequency, and location of the webpage where the keyword appears, or may only contain the docID of the webpage where the keyword appears. In this invention, we assume that each posting list consists of a series of docIDs. [0003] The full-text search engine continuously receives user query requests, performs word segmentation on query requests to obtain sev...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 敖耐勇吴迪张帆刘晓光王刚
Owner NANKAI UNIV