[0083] reference image 3 , Which shows a flowchart of Embodiment 2 of a data processing method provided by the present invention. Method Embodiment 2 of the present invention is based on Method Embodiment 1, and further includes the following steps:
[0084] Step 301: Receive a data query request, where the data query request includes query conditions.
[0085] Among them, after receiving the data query request input by the user, the query conditions included in the data query request are parsed.
[0086] Step 302: Perform hash algorithm calculation on the query condition to obtain the query key value.
[0087] Among them, it should be noted that when the hash algorithm calculation is performed on the query conditions, the hash algorithm described in the first embodiment of the method of the present invention needs to be calculated.
[0088] Step 303: In the key value of the data set, search whether there is a key value that matches the query key value, if yes, go to step 304, otherwise, go to step 305.
[0089] Wherein, when the query condition is calculated by the hash algorithm, the calculation result is used as the query key value of the query condition, and the query key value is used for data matching in the data set, that is, it is judged that there is Whether the data set of the key value of the stored data contains the key value that matches the query key value, thereby determining whether the data satisfying the data query request has been stored in the stored data.
[0090] Step 304: Find the column data corresponding to the query key value in the stored column data, and return the found column data and the row data corresponding to the column data.
[0091] When the data set contains a key value that matches the query key value of the query condition, it means that the data to be queried has been stored, so that the column corresponding to the query key value is searched for in the stored column data Data, return the found column data and the row data corresponding to the column data.
[0092] Step 305: End this data query.
[0093] When the data set does not contain a key value that matches the query key value of the query condition, it means that the data to be queried is not stored. At this time, the data query is ended.
[0094] Wherein, the query condition is calculated by the hash algorithm described in the first embodiment of the method of the present invention, and the calculated result is used as the query key value, and data matching is performed in the data set according to the query key value. When the data set contains a key value that matches the query key value, it means that the data satisfying the data query request has been stored in a columnar storage method. At this time, the stored column data is searched for the Query the column data corresponding to the key value, and return the found column data and the row data corresponding to the column data. When the data set does not contain the data key value that matches the key value of the query condition, then It indicates that the data to be queried is not stored, and at this time, the data query is ended.
[0095] It can be seen from the above that the second embodiment of a data processing method provided by the present invention is based on the first embodiment of the method provided by the present invention. The data is stored and processed in a columnar storage method after deduplication, thereby avoiding data redundancy and saving storage. Further, when performing data query statistics, first calculate the query key value through the hash algorithm, and then determine whether the query key value corresponding to the query condition is in the data set, so as to determine whether there is data to be queried in the stored data. When there is no data to be queried in the stored data, the query is ended to improve the efficiency of data query. When it is determined that the data to be queried is stored, the stored column data is read and queried, thereby reducing the distance of the head moving in the disk and improving the performance of data query statistics.
[0096] reference Figure 4 , Which shows a flowchart of Embodiment 3 of a data processing method provided by the present invention. Based on Embodiment 1 of the method of the present invention, the method may further include the following steps:
[0097] Step 401: Analyze the data attributes of the stored data, and set data nodes corresponding to the stored data according to the data attributes.
[0098] Wherein, the set data node has a one-to-one correspondence with the stored data, and the data node records statistical information of the stored data, such as a start identifier, an end identifier, a maximum value, and/or a minimum value.
[0099] Step 402: Receive a data query request, where the data query request includes query conditions.
[0100] Step 403: According to the query conditions, the data nodes are divided into related nodes, suspicious nodes, and irrelevant nodes.
[0101] Wherein, the query conditions can carry information related to the statistical information in the above data nodes, and the data nodes are divided according to this information to obtain three types of data nodes, respectively:
[0102] Data nodes that meet the query conditions are regarded as related nodes;
[0103] Data nodes that partially meet the query conditions are regarded as suspicious nodes;
[0104] Data nodes that do not meet the query conditions are regarded as irrelevant nodes.
[0105] Step 404: Query the data node corresponding to the query condition in the suspicious node, and return the stored data corresponding to the data node, and at the same time return the stored data corresponding to the relevant node.
[0106] Wherein, it can be known from step 403 that in the suspicious node, because the query condition is partially satisfied, the data node of the suspicious node is queried for the data node corresponding to the query condition, and the data node corresponding to the query is returned. Corresponding stored data, and all of the data nodes in the related node satisfy the query condition, so the stored data corresponding to the data node of the related node is returned.
[0107] It can be seen from the above that the third embodiment of a data processing method provided by the present invention divides the stored data into blocks according to the data query request, that is, performs data query from the data block that meets the data query request, reducing the number of data queries, thereby increasing Data query statistical performance.
[0108] It should be noted that, based on the above-mentioned method embodiment 2 or method embodiment 3 of the present invention, the method may further include: storing the returned data in a preset cache data set.
[0109] Wherein, after a successful data query is completed, the returned data is stored as cache data in the cache data set. When a data query request is received again, first determine whether the cached data set contains data corresponding to the data query request. If so, it means that the data that meets the data query request has been queried, then Data query is performed directly in the cached data set. If not, it means that the data that meets the data query request has not been queried, and the data query can be performed in the stored data.
[0110] Applying the above technical solution, by caching the queried data, in the next query, first query whether there is data that meets the data query request in the cached data. When the cached data does not contain data that meets the data query request, Data query is then performed from the stored data, thereby avoiding repeated querying of the same data, thereby improving the efficiency of data query.
[0111] reference Figure 5 , Which shows a flowchart of Embodiment 4 of a data processing method provided by the present invention. Based on the foregoing method embodiment of the present invention, before performing hash algorithm calculation on the data to be stored, the following steps may be further included:
[0112] Step 501: Store the data to be stored in a preset memory.
[0113] Wherein, during data storage, the data to be stored is first stored in the memory, and then the data in the memory is deduplicated and stored.
[0114] Steps 502 to 505 are respectively the same as the steps 101 to 104 of the method embodiment of the present invention, which will not be described again.
[0115] It can be seen from the above that the fourth embodiment of a data processing method provided by the present invention first stores the data to be stored in a memory with a faster read and write speed, and then performs deduplication storage processing on the data in the memory, thereby passing high The memory with read and write performance greatly improves the data throughput rate and the data processing speed when data storage is performed, especially when mass data storage.
[0116] reference Image 6 , Which shows a schematic structural diagram of Embodiment 1 of a data processing device provided by the present invention, which is used to implement Embodiment 1 of the above-mentioned method of the present invention. The device may include:
[0117] The data calculation unit 601 is configured to perform hash algorithm calculation on the data to be stored to obtain the key value of the data to be stored.
[0118] Wherein, the data calculation unit 601 can select a hash function algorithm to calculate the data to be stored when performing the hash algorithm calculation. Of course, in order to improve the accuracy of data deduplication and the accuracy of data query, When calculating the hash algorithm for the data to be stored, multiple hash function algorithms may also be selected to calculate the data to be stored separately, and the calculation result is used as the key value of the data to be stored. It should be noted that the data to be stored includes one piece of data or a data block composed of multiple pieces of data.
[0119] The data search unit 602 is configured to receive the key value sent by the data calculation unit 601, and search for whether there is a key value matching the key value in a preset data set.
[0120] The data deduplication unit 603 is configured to discard the data to be stored when the data search unit finds a key value matching the key value in a preset data set.
[0121] The data storage unit 604 is configured to store the data to be stored in a columnar storage method when the data search unit 602 does not find a key value matching the key value in the preset data set The storage is performed, and the key value of the data to be stored is stored in the data set.
[0122] Wherein, the data search unit 602 uses the key value calculated by the data calculation unit 601 to perform data matching in the data set, that is, to determine whether the data set in which the key value of the stored data contains the data to be stored The key value of the data matches the key value, thereby determining whether the data to be stored has been stored in the stored data.
[0123] When the data search unit 602 matches the key value corresponding to the key value of the data to be stored in the data set, it is judged that the data to be stored has been stored. The heavy unit 603 discards the data to be stored. When the data search unit 602 does not match the key value corresponding to the key value of the data to be stored in the data set, it is determined that the data to be stored is not stored. The storage unit 604 stores the to-be-stored data in a columnar storage method, and stores the key value of the to-be-stored data in the data set.
[0124] Wherein, the data storage unit 604 is specifically configured to convert the data structure of the data to be stored into the data structure of a two-dimensional data table, obtain the non-empty column data of each column in the two-dimensional data table, according to the two The data address sequence of the columns in the dimension data table stores the obtained non-empty column data.
[0125] It should be noted that the data is stored logically, that is, the data structure is stored in the data structure of a two-dimensional data table, and when the data is stored on a disk, it is stored in one-dimensional bytes. The row storage method commonly used in the prior art refers to stringing together the data values of each row in the two-dimensional data table and storing them to the disk, and then storing the next row of data, and so on. But when there is empty data in the two-dimensional data table, the empty data is also stored in the disk.
[0126] In the device embodiment of the present invention, the data storage unit 604 adopts a columnar storage method, that is, the non-empty column data of each column in the two-dimensional data table is obtained according to the data address of the column data in the two-dimensional data table. Store the obtained non-empty column data sequentially.
[0127] From the above, it can be seen that the first embodiment of a data processing device provided by the present invention performs a hash calculation on the data to be stored before performing columnar storage of the data to be stored, and performs deduplication processing on the data to be stored according to the calculated key value. This avoids data redundancy and waste of storage space when processing massive data. At the same time, the data to be stored is deduplicated and stored in a columnar storage method, the non-empty column data in the two-dimensional data table is obtained, and the obtained data are obtained according to the data address order of the column data in the two-dimensional data table. Non-empty column data is stored to further save storage space.
[0128] reference Figure 7 , Which shows a schematic structural diagram of Embodiment 2 of a data processing apparatus provided by the present invention. Based on Embodiment 1 of the apparatus of the present invention, the apparatus further includes a receiving unit 605, wherein:
[0129] The receiving unit 605 is configured to receive a data query request, where the data query request includes a query condition, and triggers the data calculation unit.
[0130] Wherein, after receiving the data query request input by the user, the receiving unit 605 parses the query conditions included in the data query request.
[0131] Further, the data calculation unit 601 is further configured to perform a hash algorithm calculation on the query condition to obtain the query key value.
[0132] The data search unit 602 is further configured to find a key value that matches the query key value in the key value of the data set, and if so, search for the query key value in the stored column data Corresponding column data, return the found column data and the row data corresponding to the column data.
[0133] It should be noted that the data calculation unit 601 performs a hash algorithm calculation on the query conditions, and the calculated result is used as the query key value, and the data search unit 602 performs the calculation in the data set according to the query key value. Data matching. When the data set contains the key value that matches the query key value, it means that the data satisfying the data query request has been stored in a columnar storage method. At this time, in the stored column data Find the column data corresponding to the query key value in the query, and return the found column data and the row data corresponding to the column data; when the data set does not contain the key value corresponding to the query key value When it means that the data satisfying the data query request has not been stored, the query can be ended.
[0134] From the foregoing, it can be seen that the second embodiment of a data processing device provided by the present invention is based on the first embodiment of the device provided by the present invention. The data is stored and processed in a columnar storage method after deduplication, which avoids data redundancy and saves storage. Further, when performing data query statistics, the data calculation unit 601 first calculates the query key value through the hash algorithm, and then the data search unit 602 determines whether the query key value corresponding to the query condition is in the data set to determine the stored data Whether there is data to be queried in the stored data, the query is ended when there is no data to be queried in the stored data, and the efficiency of data query is improved. When it is determined that the data to be queried is stored, data read is performed on the stored column data Fetch query, thereby reducing the moving distance of the magnetic head in the disk, thereby improving the performance of data query statistics.
[0135] reference Figure 8 , Which shows a schematic structural diagram of Embodiment 3 of a data processing apparatus provided by the present invention. Based on Embodiment 1 of the apparatus of the present invention, the apparatus may further include:
[0136] The node setting unit 606 is configured to analyze the data attributes of the stored data, and set data nodes corresponding to the stored data according to the data attributes.
[0137] Wherein, the data node set by the node setting unit 606 corresponds to the stored data one-to-one, and the data node records the statistical information of the stored data, such as start identifier, end identifier, maximum value and/or minimum value. Value etc.
[0138] The request receiving unit 607 is configured to receive a data query request, where the data query request includes query conditions.
[0139] The node dividing unit 608 is configured to divide the data node into related nodes, suspicious nodes, and irrelevant nodes according to the query conditions, and trigger the data search unit 602, and the data search unit 602 performs the operation on the suspicious node Query the data node corresponding to the query condition, and return the stored data corresponding to the data node, and return the stored data corresponding to the related node at the same time.
[0140] Wherein, the query conditions can carry information related to some statistical information in the above data nodes, and the data nodes can be classified according to this information to obtain three types of data nodes, respectively:
[0141] Data nodes that meet the query conditions are regarded as related nodes;
[0142] Data nodes that partially meet the query conditions are regarded as suspicious nodes;
[0143] Data nodes that do not meet the query conditions are regarded as irrelevant nodes.
[0144] It can be seen that among the suspicious nodes divided by the node dividing unit 608, because the query conditions are partially satisfied, the data nodes of the suspicious nodes are queried for the data nodes corresponding to the query conditions and returned The stored data corresponding to the data node, and all the data nodes in the related node satisfy the query condition, so the stored data corresponding to the data node of the related node is returned.
[0145] It can be seen from the above that the third embodiment of a data processing device provided by the present invention divides the stored data into blocks according to the data query request, that is, performs data query from the data block that meets the data query request, reducing the number of data queries, thereby increasing Data query statistical performance.