Random sampling as a built-in function for database administration and replication

a database and function technology, applied in relational databases, data processing applications, instruments, etc., can solve problems such as unfavorable data quality improvement, and inability to provide exact analysis, so as to reduce the number of system calls, reduce time, and reduce the strain on the computer system

Inactive Publication Date: 2006-04-11
IBM CORP
View PDF15 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0028]Raw partition analysis, without random sampling analysis, places a heavy strain on a computer system in terms of memory usage and typically requires multiple dataspaces. Random sampling relieves the strain on the computer system in terms of processing and memory requirements. Much less memory is required to analyze 20,000 sampled records using the random sampling approach than to analyze 2,000,000,000 records without sampling. However, in order to maintain consistency with an unsampled approach which may be desirable under some circumstances, the preferred method using random sampling analysis utilizes one or more of each of the following types of dataspaces: index, key and statistics.
[0029]One benefit obtained from the present invention as a result of providing a built-in sampling facility is the reduction in the number of system calls required to perform an approximation partition analysis.
[0030]Another benefit obtained from the present invention is the reduction in time required to perform an approximation partition analysis compared to the time required for an exact partition analysis.
[0031]Still another benefit obtained from the present invention is that approximation partition analyses is performed frequently without straining or otherwise compromising computer system resources.
[0032]Yet another benefit obtained from the present invention is an improved accuracy of the analyses, particularly for homogeneous database populations.
[0033]Yet another benefit obtained from the present invention is that a random sample of predetermined size is obtained without prior knowledge of the number of records in the sampled database.

Problems solved by technology

Analysis of these large databases for administration and replication purposes typically involves processes which are very input / output intensive, as numerous queries must be performed by an analysis program across a vast number of records.
It is typically not possible to provide an exact analysis without first removing a database from online for an extended period of time.
The method and system provided are unique in that a random sample is selected of predetermined known size, but uniformly distributed across the entire database, from a database of known or unknown size while reading only a fraction of the records in the database without the requirement of indexing the entire database which, as indicated above, is time consuming and provides results having an unnecessary degree of precision.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Random sampling as a built-in function for database administration and replication
  • Random sampling as a built-in function for database administration and replication
  • Random sampling as a built-in function for database administration and replication

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038]The capacity of DL / I databases is limited by the maximum size of a data set that can be addressed by a four-byte relative byte address (RBA). Many other databases in use presently suffer from similar size limitations. In current full function databases managed by database management systems such as IMS, multiple data sets are supported. This helps to increase the capacity of the database. One requirement, however, is that all segments of the same type must be in the same data set. As a result, when one data set is full, the database is deemed to be essentially full even if empty space exists in the remaining data sets. As a consequence, methods have been developed to extend the capacity of such databases.

[0039]As shown in FIG. 1, partitioning removes the data set limitation by relieving the restriction that all occurrences of the same segment type must be in the same data set. Partitioning database 10 groups database records into sets of partitions 12 that are treated as a sin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A database management system and method for administration and replication having a built-in random sampling facility for approximation partition analysis on very large databases. The method utilizes a random sampling algorithm that provides results accurate to within a few percentage points for large homogeneous databases. The accuracy is not affected by the size of the database and is determined primarily by the size of the sample. The system and method for approximate partition analysis reduces the time required for an analysis to a fraction of the time required for an exact analysis. The database management system is configured with the random sampling facility built-in thereby enabling even greater efficiency by reducing communication overhead between an analysis program and the database management system to a fraction of the overhead required when sampling is performed by a separate analysis program. The reduction in time thereby permits frequent and timely analyses for replication and administration of database partitions.

Description

CROSS-REFERENCE TO RELATED APPLICATION[0001]This application is related to U.S. application Ser. No. 09 / 897,853, filed together with this application, entitled Partition Boundary Determination Using Random Sampling on Very Large Databases.BACKGROUND OF THE INVENTION[0002]The invention pertains to partition size analysis for very large databases having multiple partitions and, more particularly, to accurate, fast, and scalable characterization and estimation of large populations using a random sampling function that is integrated directly into a database engine.[0003]Databases provide a means to conveniently store and retrieve a wealth of information such as, in the business setting, individual and corporate accounts and, in the business example provide a means to analyze business trends and make other business, educational, and scientific decisions. Accordingly, over the years, typical database populations reach upward of a billion rows and records.[0004]Analysis of these large data...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(United States)
IPC IPC(8): G06F17/30G06F12/00
CPCG06F17/30595Y10S707/99953G06F16/284
Inventor HARPER, JOHN WILLIAMSLISHMAN, GORDON ROBERT
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products