A method and apparatus for sampling a data table
By performing initial and subsequent partitioning of the primary key index structure of the database table to exclude null value ranges, uniform sampling of the data table is achieved, solving the problems of uneven sampling and high null value rate in existing technologies, and improving the accuracy and efficiency of data table management.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CORP LTD
- Filing Date
- 2022-08-10
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, the sampling methods for database tables cannot be evenly distributed, easily resulting in the collection of a large number of null values. Furthermore, it is difficult to accurately identify the business to which the data table belongs and to screen sensitive information, leading to difficulties in data asset management.
By obtaining the primary key index structure of the target data table, initial partitioning and further partitioning are performed, null value intervals are excluded, and primary key IDs are sampled until the target number is reached, ensuring the uniformity and accuracy of the sampling results.
It enables uniform and rapid sampling of database tables, reduces the impact of null values, ensures the representativeness of sampling results, and improves the efficiency of previewing and managing data table content.
Smart Images

Figure CN115438042B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of databases, and in particular to a method and apparatus for sampling data tables. Background Technology
[0002] Currently, databases such as MySQL are widely used for persistent storage in numerous enterprise-level internet projects, becoming core data assets for enterprises. Therefore, the amount of data stored in these databases is often very large. However, due to the diverse nature of business processes and frequent iterations of requirements, it's rare to perform statistical maintenance on these data assets during database use. This leads to the following problems:
[0003] Because the data table content is not visible, it is impossible to determine whether it is a junk table, which business it belongs to, and it is difficult to investigate and classify sensitive information.
[0004] Therefore, it is necessary to develop a fast and accurate data table scanning scheme to facilitate the identification of tables in the database and provide preview and sampling analysis capabilities.
[0005] In existing technologies, one way to sample a data table is to take the first few records in ascending or descending order of the primary key ID. However, the data sampled using this method is likely to be limited to the earliest and latest inserted data, and cannot be evenly distributed throughout the table.
[0006] Another approach to sampling a data table is to use a random primary key ID to retrieve data from the table. However, this approach inevitably encounters the problem of uneven distribution of primary keys. Because deletion and updates cause primary key IDs to not increment at equal intervals, there are many gaps in primary key IDs, which may result in the collection of a large number of null values. In other words, the data portion corresponding to the sampled primary key ID may be null, which may not be representative of the entire storage space. Summary of the Invention
[0007] In view of the above problems, a data table sampling method and apparatus are proposed to overcome or at least partially solve the above problems, comprising:
[0008] A method for sampling a data table, the method comprising:
[0009] Obtain the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID;
[0010] The target data table is initially partitioned to obtain multiple initial data table intervals, and samples are taken for each initial data table interval to obtain multiple primary key IDs;
[0011] Repeatedly execute the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and then sampling the surviving data table intervals after the re-segmentation operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number.
[0012] Optionally, determining one or more live data table ranges based on multiple primary key IDs obtained from the current sampling includes:
[0013] Determine the range of multiple candidate data tables containing the multiple primary key IDs obtained from the current sampling;
[0014] For each candidate data table interval, the non-surviving data table intervals between the primary key ID located at the boundary and the sampled primary key ID are determined, and the data table intervals other than the non-surviving data table intervals in the candidate data table intervals are determined as surviving data table intervals.
[0015] Optionally, the sampling for each initial data table interval yields multiple primary key IDs, including:
[0016] For each initial data table range, determine the primary key ID located at the boundary;
[0017] Within each initial data table interval, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0018] Optionally, the initial partitioning operation is an equally spaced partitioning operation, wherein the initial partitioning operation on the target data table to obtain multiple initial data table intervals includes:
[0019] The target data table is divided into multiple initial data table intervals by performing an equally spaced partitioning operation.
[0020] Optionally, the sampling of the surviving data table intervals after the re-division operation yields multiple primary key IDs, including:
[0021] For the surviving data table range after the re-division operation, determine the primary key ID located at the boundary;
[0022] Within the surviving data table interval after the re-segmentation operation, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0023] Optionally, the re-partitioning operation is a binary search operation. Before sampling the surviving data table intervals after the re-partitioning operation to obtain multiple primary key IDs, the method further includes:
[0024] Perform a binary search operation on each interval of the survival data table;
[0025] Sample the primary key ID located in the middle and use the primary key ID located in the middle as the primary key ID located at the boundary to obtain the surviving data table interval after the binary search operation.
[0026] Optionally, the target data table is a data table in a MySQL database based on the InnoDB storage engine.
[0027] A sampling device for a data table, the device comprising:
[0028] The target data acquisition module is used to acquire the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID;
[0029] The initial partitioning module is used to perform initial partitioning operations on the target data table to obtain multiple initial data table intervals, and to sample each initial data table interval to obtain multiple primary key IDs;
[0030] The repetitive execution module is used to repeatedly execute the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and sampling the surviving data table intervals after the re-segmentation operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number.
[0031] An electronic device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the sampling method of the data table described above.
[0032] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the sampling method for the data table described above.
[0033] The embodiments of the present invention have the following advantages:
[0034] In this embodiment of the invention, a target data table is obtained. The primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID. Then, an initial partitioning operation is performed on the target data table to obtain multiple initial data table intervals. For each initial data table interval, sampling is performed to obtain multiple primary key IDs. The process is repeated to determine one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling. The surviving data table intervals after the second partitioning operation are then sampled to obtain multiple primary key IDs until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number. This achieves uniform and fast sampling of the data table, solves the problem of uneven distribution of primary key IDs, ensures that the data table sampled based on the primary key ID is non-nullable, and reduces the impact of nullables in the data table on the sampling results. Attached Figure Description
[0035] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the description of the present invention will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0036] Figure 1 This is a flowchart of the steps of a data table sampling method provided in an embodiment of the present invention;
[0037] Figure 2 This is a flowchart of another data table sampling method provided in an embodiment of the present invention;
[0038] Figure 3a This is a schematic diagram of a data table sampling method provided in an embodiment of the present invention;
[0039] Figure 3b This is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0040] Figure 3c This is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0041] Figure 4a This is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0042] Figure 4b This is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0043] Figure 4c This is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0044] Figure 4dThis is a schematic diagram of another data table sampling method provided in an embodiment of the present invention;
[0045] Figure 5 This is a structural block diagram of a data table sampling device provided in an embodiment of the present invention. Detailed Implementation
[0046] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0047] In this embodiment of the invention, based on the underlying principle of InnoDB, the default storage engine of MySQL, by utilizing the InnoDB primary key index and dividing the table data range, the null value range corresponding to the primary key ID (i.e., the data table area where the data part is all null values) is cleared one by one. More data is collected in the area where the primary key ID is densely distributed and less data is collected in the sparse area to achieve the set sampling rate, so as to achieve uniform and fast sampling of the data table, thereby enabling the content preview of the data table and more accurately describing the data storage content over the entire time range.
[0048] Moreover, the time cost of taking the "upper bound" of a certain value X (i.e., the minimum value greater than or equal to X) is the same as the time cost of taking a row of data based on the primary key ID. It only adds one step of scanning one position backward according to the primary key order, which has a time cost that is several orders of magnitude smaller. In other words, the time complexity of database operations in the embodiments of the present invention is the same as the time complexity of taking data based on a single primary key ID.
[0049] The following provides further explanation:
[0050] Reference Figure 1 The diagram illustrates a structural schematic of a data table sampling method according to an embodiment of the present invention, which may specifically include the following steps:
[0051] Step 101: Obtain the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID.
[0052] In practical applications, the target data table to be sampled can be obtained. Specifically, the target data table can be queried from the database through the client, and the target data table can be stored in the database.
[0053] In this embodiment of the invention, the operation of querying and sampling the target data table is performed by the database, while other processes such as dividing the data table into intervals are performed by the client. The database only performs a few primary key ID searches, which puts little pressure on the database.
[0054] The target data table can be a data table in a MySQL database based on the InnoDB storage engine. Since the target data table can be a data table in a MySQL database based on the InnoDB storage engine, the InnoDB storage engine stores the primary key index and all data of the MySQL data table in a primary key index structure, and the data in the primary key index is stored in order of size of the primary key ID.
[0055] Step 102: Perform an initial partitioning operation on the target data table to obtain multiple initial data table intervals, and sample each initial data table interval to obtain multiple primary key IDs.
[0056] The primary key ID obtained from the sampling is the primary key ID corresponding to a non-null value in the data part.
[0057] In the specific implementation, an initial partitioning operation can be performed on all data table intervals in the target data table to obtain multiple initial data table intervals. Then, each initial data table interval can be sampled to obtain multiple primary key IDs, with one primary key ID obtained from sampling each initial data table interval.
[0058] In one embodiment of the present invention, the initial partitioning operation can be an equally spaced partitioning operation. The initial partitioning of the target data table to obtain multiple initial data table intervals can include:
[0059] The target data table is divided into multiple initial data table intervals by performing an equally spaced partitioning operation.
[0060] In practice, the target data table can be divided into multiple initial data table regions by equal intervals. In other words, the width of each initial data table region is the same.
[0061] In one embodiment of the present invention, the step of sampling for each initial data table interval to obtain multiple primary key IDs may include:
[0062] For each initial data table interval, determine the primary key ID located at the boundary; within each initial data table interval, sample the smallest primary key ID from primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0063] For each data table interval, there are two boundaries: a right boundary (larger) and a left boundary (smaller). The primary key ID located at the left boundary can be determined, which is the smallest primary key ID.
[0064] After determining the primary key ID located at the left boundary, sampling can be performed on other primary key IDs that are greater than or equal to the primary key ID located at the left boundary. The sampled primary key IDs must simultaneously meet two conditions:
[0065] 1. The data portion corresponding to the sampled primary key ID is not null;
[0066] 2. The primary key ID of the sample is the smallest among the non-null primary key IDs in the data part.
[0067] Step 103: Repeat the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and sampling the surviving data table intervals after the re-division operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number.
[0068] The target number can be the product of the minimum sampling rate and the span of the primary key ID in the table.
[0069] After obtaining multiple primary key IDs through sampling, one or more surviving data table intervals can be determined based on the multiple primary key IDs obtained through sampling. A surviving data table interval is one in which there is definitely a non-nullable data part corresponding to the primary key ID. This initially eliminates non-surviving data table areas where all data parts are nullable, thus narrowing down the sampling area.
[0070] After determining the surviving data table interval, the surviving data table interval can be further divided. Then, the surviving data table interval after the further division can be sampled to obtain multiple primary key IDs. The primary key IDs obtained by sampling are the primary key IDs corresponding to non-null values in the data part.
[0071] After multiple samplings, the cumulative number of primary key IDs obtained from sampling can be calculated. If the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number, sampling can be stopped. If the cumulative number of primary key IDs obtained from sampling is less than the target number, step 103 needs to be repeated, that is, based on the multiple primary key IDs obtained from sampling the surviving data table intervals after the re-partitioning operation, one or more surviving data table intervals are re-determined, and then the re-partitioning operation and sampling are repeated.
[0072] After obtaining a number of primary key IDs greater than or equal to the target number, the data table content can be previewed based on the collected primary key IDs, providing a more accurate description of the data storage content over the entire time range.
[0073] In one embodiment of the present invention, determining one or more live data table ranges based on multiple primary key IDs obtained from current sampling may include:
[0074] Determine multiple candidate data table intervals where the multiple primary key IDs obtained from the current sampling reside; for each candidate data table interval, determine the non-surviving data table intervals from the primary key ID located at the boundary to the sampled primary key ID, and determine the data table intervals in the candidate data table intervals other than the non-surviving data table intervals as the surviving data table intervals.
[0075] Specifically, since the sampled primary key ID is the primary key ID whose corresponding data part is a non-null value, if there is a sampled primary key ID in a certain data table interval, it indicates that there must be a primary key ID in that data table interval whose data part is a non-null value. This eliminates data table areas where all data parts are null values, and preliminarily determines multiple candidate data table intervals.
[0076] After determining the candidate data table intervals, since the sampled primary key ID is the smallest among the non-null primary key IDs in the data portion, if it is not a primary key ID located at the boundary (the smaller left boundary), it indicates that the data portion between the primary key ID located at the boundary and the sampled primary key ID is null, i.e., a non-surviving data table interval. After determining the non-surviving data table intervals, the data table intervals other than the non-surviving data table intervals in the candidate data table intervals can be determined as surviving data table intervals.
[0077] In one embodiment of the present invention, the re-division operation can be a binary search operation. Before sampling the surviving data table interval after the re-division operation to obtain multiple primary key IDs, the following may also be included:
[0078] Perform a binary search operation on each surviving data table interval; sample the primary key ID located in the middle, and use the primary key ID located in the middle as the primary key ID located at the boundary, to obtain the surviving data table interval after the binary search operation.
[0079] In a specific implementation, a binary search operation can be performed on the surviving data table interval, that is, it can be divided into two equal parts. The primary key ID located in the middle can be used as the primary key ID located at the boundary. That is, it can be used as the left boundary (smaller) of the surviving data table interval after the binary search operation, while its right boundary (larger) is still the right boundary of the surviving data table interval.
[0080] In one embodiment of the present invention, the sampling of the surviving data table intervals after the re-division operation to obtain multiple primary key IDs includes:
[0081] For the surviving data table interval after the re-division operation, determine the primary key ID located at the boundary; within the surviving data table interval after the re-division operation, sample the smallest primary key ID from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0082] For each data table interval, there are two boundaries: a right boundary (larger) and a left boundary (smaller). The primary key ID located at the left boundary can be determined, which is the smallest primary key ID.
[0083] After determining the primary key ID located at the boundary, sampling can be performed on primary key IDs that are greater than or equal to the primary key ID located at the boundary, and the sampled primary key IDs must simultaneously meet two conditions:
[0084] 1. The data portion corresponding to the sampled primary key ID is not empty;
[0085] 2. The primary key ID sampled is the smallest among the non-null primary key IDs in the data portion.
[0086] In this embodiment of the invention, a target data table is obtained. The primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID. Then, an initial partitioning operation is performed on the target data table to obtain multiple initial data table intervals. For each initial data table interval, sampling is performed to obtain multiple primary key IDs. The process is repeated to determine one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling. The surviving data table intervals after the second partitioning operation are then sampled to obtain multiple primary key IDs until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number. This achieves uniform and fast sampling of the data table, solves the problem of uneven distribution of primary key IDs, ensures that the data table sampled based on the primary key ID is non-nullable, and reduces the impact of nullables in the data table on the sampling results.
[0087] Specifically, compared with the prior art, the embodiments of the present invention have the following beneficial effects:
[0088] 1. Since the primary key ID in the data table is not strictly auto-incrementing at equal intervals, the sampling scheme using random primary key IDs is likely to obtain a large number of null values. However, the embodiments of the present invention solve the problem that the sampling scheme using random primary key IDs cannot handle the sparse distribution of primary key IDs by eliminating null values.
[0089] 2. By making full use of the primary key index, leveraging its sequential nature and InnoDB's physical storage method, it achieves a relatively fast execution speed.
[0090] 3. Since many calculations are performed by the client program after being retrieved from the database, the database only performs a few primary key ID searches, which reduces the pressure on the database.
[0091] 4. It can obtain the approximate range of primary key ID distribution, and can intuitively show the deletion and modification status of data since the table was created. After excluding the obvious empty range of primary key IDs, it can combine the maximum primary key ID to obtain the upper bound of the number of table rows, which is much more efficient than executing count.
[0092] Reference Figure 2 The diagram illustrates a structural schematic of another data table sampling method provided by an embodiment of the present invention, which may specifically include the following steps:
[0093] Step 201: Obtain the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID.
[0094] Step 202: Perform an equal-interval partitioning operation on the target data table to obtain multiple initial data table intervals.
[0095] Step 203: For each initial data table range, determine the primary key ID located at the boundary.
[0096] Step 204: Within each initial data table interval, sample the smallest primary key ID from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0097] Step 205: Determine the range of candidate data tables containing the multiple primary key IDs obtained from the current sampling.
[0098] Step 206: For each candidate data table interval, determine the non-surviving data table intervals from the primary key ID located at the boundary to the sampled primary key ID, and determine the data table intervals other than the non-surviving data table intervals in the candidate data table intervals as surviving data table intervals.
[0099] Step 207: Perform a binary search operation on each interval of the surviving data table.
[0100] Step 208: Sample the primary key ID located in the middle and use the primary key ID located in the middle as the primary key ID located at the boundary to obtain the surviving data table interval after the binary search operation.
[0101] Step 209: For the surviving data table interval after the binary search operation, determine the primary key ID located at the boundary.
[0102] Step 210: Within the surviving data table interval after the binary search operation, sample the smallest primary key ID from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0103] Step 211: Repeat steps 205 to 210 until the number of primary key IDs obtained by cumulative sampling is greater than or equal to the target number.
[0104] The following combination Figures 3a to 3c For example:
[0105] In this example, the target data table is `my_table`, and its primary key index is `ID`, which is of integer type. The minimum primary key ID value in the table is found using `select min(ID) from my_table` and set as `minID`. Then, the maximum primary key ID value is found using `select max(ID) from my_table` and set as `maxID`. The sampling process selects the primary key ID between `minID` and `maxID`, directly locating and storing the data through the primary key index table without performing a table lookup operation.
[0106] For ease of description, several variables are defined: the primary key ID span range = maxID - minID + 1, the minimum sampling rate p (p << 1) for a table, and the maximum number of samples m. Setting the maximum number of samples can be used to limit the number of returned results for very large tables.
[0107] In this example, range is 64, p is 12.5%, and xm is 100 (xm > p * range).
[0108] exist Figures 3a to 3c In the table, each grid corresponds to a primary key ID. The primary key IDs increase by 1 from left to right. The white grid area represents that the data part corresponding to the primary key ID is null, and the gray grid area represents that the data at the corresponding primary key ID is not null. In other words, the white and gray of the grid represent the correspondence between whether the data stored in the actual data table is null and its primary key ID.
[0109] in, Figure 3a This is the data table before sampling. This table effectively simulates a production table that has undergone multiple modifications and deletions since its creation. The primary key ID has many gaps. Figures 3b to 3c The data tables for the first and second sampling are respectively, and the initial partitioning operation (equal interval partitioning operation) and the subsequent partitioning operation (binary search operation) are performed through... Figure 3b , Figure 3c The vertical lines are used to indicate that the data is divided into multiple data table intervals.
[0110] 1. Divide the data table my_table into equal intervals to obtain multiple initial data table intervals.
[0111] If p*range<=xm, divide the primary key ID in the table into p*range (rounded up) left-closed and right-open data table intervals (i.e., the initial data table intervals). That is, the data table interval includes the left boundary but does not include the right boundary. The starting ID of the left boundary of each interval is minID+n*(1 / p).
[0112] If p*range > xm, then the primary key ID in the table is divided into xm data table intervals at equal intervals (i.e., the initial data table intervals). The starting ID of the left boundary of each data table interval is minID + n*range / xm. Here, n is the interval number starting from zero, and 0 <= n <= p*range - 1.
[0113] Since the subsequent sampling process is the same for both, we assume that p*range<=xm and divide the primary key ID in the table into p*range (rounded up) left-closed and right-open data table intervals at equal intervals.
[0114] After performing an equal-interval partitioning operation on the data table, the data table is divided into p*range = 8 intervals, such as... Figure 3b The vertical lines divide the data table into 8 areas.
[0115] 2. For the data table my_table, perform the first sampling to obtain multiple primary key IDs.
[0116] When sampling each initial data table interval, the smallest ID value in the current data table interval that is greater than or equal to the primary key ID located at the left boundary is taken (that is, in each initial data table interval, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data part is not empty). This ensures that a value is taken in each interval (if the interval is not completely empty) and that a value can be obtained.
[0117] Specifically, the SQL statement can be used as follows:
[0118] select ID from my_table where ID in((select min(ID) from Table whereID>= minID ),( select min(ID) from Table where ID>= (minID+1 / p)),...(selectmin(ID) from Table where ID>= (p*range-1 / p)).
[0119] like Figure 3b The primary key ID obtained from the first sampling is identified. Based on the six different IDs returned from the first sampling, the approximate range of non-empty data corresponding to the primary key ID is calculated.
[0120] If all data in a certain data table interval is null, the primary key ID of the sampled data will be set to be the same as the primary key ID of the first data table interval that follows, in order to determine whether all data in a certain data table interval is null.
[0121] If the primary key IDs in the table are relatively evenly distributed, and the data portion corresponding to each primary key ID in each partitioned data table interval is not entirely empty, then the number of distinct primary key IDs returned is equal to p*range, which can be directly used as the return result. If some data table intervals contain all empty values, then the number of distinct IDs returned is less than p*range, and in this case, that interval needs to be excluded from further sampling.
[0122] 3. Based on the multiple primary key IDs obtained from the first sampling, determine one or more surviving data table intervals from multiple initial data table intervals.
[0123] After obtaining the multiple primary key IDs returned, iterate through the returned primary key IDs.
[0124] If it falls within the left-closed, right-open interval [minID+n*(1 / p), minID+(n+1)*(1 / p)) divided before the first sampling (i.e., the initial data table interval), then the data table interval may have "surviving" primary key IDs starting from this primary key ID. This data table interval can be a candidate data table interval (i.e., determine the multiple candidate data table intervals where the multiple primary key IDs obtained from the current sampling are located). Surviving means that there is any data in any data table corresponding to any ID within a range of IDs.
[0125] If no primary key ID is found within a certain data table range after traversal, then the data portion of this data table range must be completely empty, meaning that the entire data table range is a non-living data table range.
[0126] For candidate data table intervals, since the minimum ID value of the primary key ID in the entire interval that is greater than or equal to the primary key ID located on the left boundary of the interval is taken, if it is not the primary key ID located on the boundary (the smaller left boundary), it indicates that the data part between the primary key ID located on the boundary and the sampled primary key ID is null, that is, a non-surviving data table interval (that is, for each candidate data table interval, the non-surviving data table interval between the primary key ID located on the boundary and the sampled primary key ID is determined).
[0127] It is obvious that after the first sampling, it can be determined that the data portion within the interval consists entirely of non-surviving control regions. Figure 3b As shown by the double arrow in the middle.
[0128] After determining the non-surviving data table intervals, the data table intervals other than the non-surviving data table intervals in the candidate data table intervals can be identified as the surviving data table intervals.
[0129] For ease of description, let the first six primary key IDs be a1nx, where a1 is the code of the first sampling result, n is the interval number starting from zero, and x is the left-to-right sequence number of the "survival" interval. In this example, the six ID codes are a101, a112, a133, a144, a165, and a176.
[0130] like Figure 3b In the sampling of the seventh interval, if a1nx is exactly equal to minID+(n+1)*(1 / p)-1, it is excluded from the survival interval because subsequent binary search cannot be performed. The survival intervals after filtering are [a101, minID+(0+1)*(1 / p)=9), [a112, minID+(1+1)*(1 / p)=17)……[a176, minID+(7+1)*(1 / p)=65), which is... Figure 3b The area not covered by the double arrows.
[0131] 4. Sample the surviving data table intervals after the re-division operation to obtain multiple primary key IDs.
[0132] After performing a bisection operation on the survivor data table intervals, the midpoint value can be sampled to bisect each survivor data table interval (i.e., perform a bisection operation on each survivor data table interval; sample the primary key ID located in the middle, and use the primary key ID located in the middle as the primary key ID located on the boundary, thus obtaining the survivor data table interval after the bisection operation), such as... Figure 3c The data table after the binary search operation is shown below. The surviving data table intervals are then divided into two halves using vertical lines. The second sampling result is as follows: Figure 3c As indicated by the label, we can see several new empty intervals obtained after the second sampling, which are regions where the data portion is entirely empty.
[0133] For ease of description, let the midpoint of the interval, *ex*, be (minID + (n + 1) * (1 / p) - a1nx) / 2 rounded up, where *x* is the left-to-right index of the survival interval. Whether to take the value at the midpoint of the survival data table interval, or to take the minimum value greater than or equal to the midpoint ID (i.e., within the survival data table interval after the re-division operation, sample the minimum primary key ID from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty), the following SQL statement can be used:
[0134] select ID from Table where ID in((select min(ID) from Table where ID>= e1),(select min(ID) from Table where ID>= e2),...(select min(ID) fromTable where ID>= ex).
[0135] 5. Determine whether the number of primary key IDs obtained from the cumulative sampling is greater than or equal to the target number. If yes, stop sampling; otherwise, repeat sampling.
[0136] At this point, calculate the total number of different IDs collected. If the total number reaches p*range (the target number), there is no need to continue dividing the data into intervals. If the target number has not been reached, repeat the judgment of the survival interval and continue to collect the midpoint value of the bisection in the survival data table interval.
[0137] In this example, such as Figure 3c The target number is p*range = 8. The number of distinct IDs after two samplings is 10, reaching the sampling rate, and the sampling ends. If the number of samples is less than the target (greater than or equal to p*range), the bisection and sampling process is repeated.
[0138] pass Figures 3a to 3c As can be seen intuitively, if there are a large number of empty intervals in the table, the sampling position will eventually be located in the part with the densest data until the final collection result is obtained.
[0139] The following combination Figures 4a to 4d An example illustration of sampling under extreme conditions:
[0140] In this example, range=64, p=1.5, xm=8.
[0141] exist Figures 4a to 4d In the table, each grid corresponds to a primary key ID. The primary key IDs increase by 1 from left to right. The white grid area represents that the data part corresponding to the primary key ID is null, and the gray grid area represents that the data at the corresponding primary key ID is not null. In other words, the white and gray of the grid represent the correspondence between whether the data stored in the actual data table is null and its primary key ID.
[0142] in, Figure 4a This is the data table before sampling. Figures 4b to 4d The data tables represent the data from the first, second, and third sampling processes, respectively. The results of the fourth sampling are omitted because the process is the same. The initial partitioning operation (equal-interval partitioning operation) and the subsequent partitioning operation (binary partitioning operation) are performed using... Figures 4b to 4dThe vertical lines are used to indicate that the data is divided into multiple data table intervals.
[0143] After the first sampling, such as Figure 4b Only two data table intervals are non-surviving data table intervals (the rightmost boundary of the interval does not belong to the interval), and the rest of the data table intervals are surviving data table intervals.
[0144] Perform a binary search on the survival data table interval, and then perform a second sampling (e.g.) Figure 4c ), third sampling (e.g. Figure 4d By continuously eliminating new non-surviving data table intervals, the size of the data table interval is eventually reduced to 1, which is the smallest grid cell.
[0145] During the fourth sampling, since the data table interval has been divided into the smallest grid unit, a new "live" ID is obtained from the newly divided interval, achieving the goal of sampling 8 records and returning.
[0146] As can be seen, since the primary key IDs with non-null values in the data portion are concentrated at the leftmost (i.e., smallest) of each data table interval, in the second sampling (such as... Figure 4c ), third sampling (e.g. Figure 4d If no new primary key ID is sampled during the initial sampling, and empty intervals cannot be completely eliminated after the binary search operation, the data table interval is divided into the smallest grid unit, i.e., the interval length is 1. Then, a fourth sampling is performed, and a new primary key ID can be sampled. Of course, if a new primary key ID is still not obtained after reducing the interval to 1, it means that there are fewer than 8 records in the table, and the program will return.
[0147] At this point, the total number of samples is Log2(range / xm)*(xm-2)+1=19. The maximum number of interactions with the database is 1 sample of the maximum and minimum IDs + 1 sample of the first equally spaced intervals + 5 samples of the intervals until enough values are sampled (log2(range / xm) times). In other words, the worst-case scenario only occurs when all the data in the table exists exactly at the left endpoint of the interval after dividing the maximum and minimum primary key IDs into xm equal parts (i.e., the minimum value of each table interval).
[0148] Therefore, in this embodiment of the invention, since even in extreme cases, the number of samplings and the number of interactions with the database only increase logarithmically, this embodiment of the invention does not impose any additional restrictions, so that each table will be sampled uniformly and completely.
[0149] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0150] Reference Figure 5 The diagram shows a structural schematic of a data table sampling device according to an embodiment of the present invention, which may specifically include the following modules:
[0151] The target data acquisition module 501 is used to acquire the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID.
[0152] The initial partitioning module 502 is used to perform an initial partitioning operation on the target data table to obtain multiple initial data table intervals, and to sample each initial data table interval to obtain multiple primary key IDs.
[0153] The repeat execution module 503 is used to repeatedly execute the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and sampling the surviving data table intervals after the re-division operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number.
[0154] In one embodiment of the present invention, determining one or more live data table ranges based on multiple primary key IDs obtained from current sampling includes:
[0155] Determine the range of multiple candidate data tables containing the multiple primary key IDs obtained from the current sampling;
[0156] For each candidate data table interval, the non-surviving data table intervals between the primary key ID located at the boundary and the sampled primary key ID are determined, and the data table intervals other than the non-surviving data table intervals in the candidate data table intervals are determined as surviving data table intervals.
[0157] In one embodiment of the present invention, the sampling for each initial data table interval to obtain multiple primary key IDs includes:
[0158] For each initial data table range, determine the primary key ID located at the boundary;
[0159] Within each initial data table interval, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0160] In one embodiment of the present invention, the initial partitioning operation is an equally spaced partitioning operation, wherein the initial partitioning operation on the target data table to obtain multiple initial data table intervals includes:
[0161] The target data table is divided into multiple initial data table intervals by performing an equally spaced partitioning operation.
[0162] In one embodiment of the present invention, the sampling of the surviving data table intervals after the re-division operation to obtain multiple primary key IDs includes:
[0163] For the surviving data table range after the re-division operation, determine the primary key ID located at the boundary;
[0164] Within the surviving data table interval after the re-segmentation operation, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
[0165] In one embodiment of the present invention, the device is further configured to:
[0166] Perform a binary search operation on each interval of the survival data table;
[0167] Sample the primary key ID located in the middle and use the primary key ID located in the middle as the primary key ID located at the boundary to obtain the surviving data table interval after the binary search operation.
[0168] In one embodiment of the present invention, the target data table is a data table in a MySQL database based on the InnoDB storage engine.
[0169] In this embodiment of the invention, a target data table is obtained. The primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID. Then, an initial partitioning operation is performed on the target data table to obtain multiple initial data table intervals. For each initial data table interval, sampling is performed to obtain multiple primary key IDs. The process is repeated to determine one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling. The surviving data table intervals after the second partitioning operation are then sampled to obtain multiple primary key IDs until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number. This achieves uniform and fast sampling of the data table, solves the problem of uneven distribution of primary key IDs, ensures that the data table sampled based on the primary key ID is non-nullable, and reduces the impact of nullables in the data table on the sampling results.
[0170] An embodiment of the present invention also provides an electronic device, which may include a processor, a memory, and a computer program stored in the memory and capable of running on the processor. When the computer program is executed by the processor, it implements the sampling method of the data table as described above.
[0171] An embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, it implements the sampling method of the data table as described above.
[0172] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.
[0173] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0174] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0175] Embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0176] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0177] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0178] Although preferred embodiments of the present invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present invention.
[0179] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0180] The above provides a detailed description of a data table sampling method and apparatus. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, those skilled in the art will recognize that there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A sampling method for a data table, characterized in that, The method includes: Obtain the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID; The target data table is initially partitioned to obtain multiple initial data table intervals, and samples are taken for each initial data table interval to obtain multiple primary key IDs; the target data table is a data table in a MySQL database. Repeatedly execute the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and sampling the surviving data table intervals after the re-segmentation operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number. The target number is the product of the minimum sampling rate and the span of the primary key ID in the table. The sampling of the surviving data table intervals after the re-division operation yields multiple primary key IDs, including: For the surviving data table range after the re-division operation, determine the primary key ID located at the boundary; Within the surviving data table interval after the re-division operation, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key IDs located at the boundary and whose corresponding data portions are not empty; The initial partitioning operation is an equally spaced partitioning operation, and the subsequent partitioning operation is a binary partitioning operation.
2. The method according to claim 1, characterized in that, The step of determining one or more live data table ranges based on multiple primary key IDs obtained from the current sampling includes: Determine the range of multiple candidate data tables containing the multiple primary key IDs obtained from the current sampling; For each candidate data table interval, the non-surviving data table intervals between the primary key ID located at the boundary and the sampled primary key ID are determined, and the data table intervals other than the non-surviving data table intervals in the candidate data table intervals are determined as surviving data table intervals.
3. The method according to claim 1 or 2, characterized in that, The sampling process for each initial data table range yields multiple primary key IDs, including: For each initial data table range, determine the primary key ID located at the boundary; Within each initial data table interval, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key ID located at the boundary and whose corresponding data portion is not empty.
4. The method according to claim 3, characterized in that, The initial partitioning operation on the target data table yields multiple initial data table intervals, including: The target data table is divided into multiple initial data table intervals by performing an equally spaced partitioning operation.
5. The method according to claim 1, characterized in that, Before sampling the surviving data table intervals after the re-segmentation operation to obtain multiple primary key IDs, the process also includes: Perform a binary search operation on each interval of the survival data table; Sample the primary key ID located in the middle and use the primary key ID located in the middle as the primary key ID located at the boundary to obtain the surviving data table interval after the binary search operation.
6. The method according to claim 1, characterized in that, The target data table is a data table in a MySQL database based on the InnoDB storage engine.
7. A sampling device for a data table, characterized in that, The device includes: The target data acquisition module is used to acquire the target data table; wherein the primary key index and data in the target data table are stored in a primary key index structure, and the data in the primary key index are arranged and stored in order of the size of the corresponding primary key ID; The initial partitioning module is used to perform initial partitioning operations on the target data table to obtain multiple initial data table intervals, and to sample each initial data table interval to obtain multiple primary key IDs; the target data table is a data table in a MySQL database. The repetitive execution module is used to repeatedly execute the process of determining one or more surviving data table intervals based on the multiple primary key IDs obtained from the current sampling, and sampling the surviving data table intervals after the re-division operation to obtain multiple primary key IDs, until the cumulative number of primary key IDs obtained from sampling is greater than or equal to the target number; wherein, the target number is the product of the minimum sampling rate and the size of the table primary key ID span; The device is also used to determine the primary key ID located at the boundary for the surviving data table interval after the re-division operation; Within the surviving data table interval after the re-division operation, the smallest primary key ID is sampled from the primary key IDs that are greater than or equal to the primary key IDs located at the boundary and whose corresponding data portions are not empty; The device is further configured such that the initial division operation is an equally spaced division operation, and the subsequent division operation is a binary division operation.
8. An electronic device, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the sampling method of the data table as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, A computer program is stored on the computer-readable storage medium, which, when executed by a processor, implements the sampling method for the data table as described in any one of claims 1 to 6.