Data processing method, device and system

A data processing and data technology, applied in the field of data processing, can solve the problems of storage space waste, data redundancy, etc., and achieve the effect of avoiding waste, reducing the moving distance of the magnetic head, and improving the performance of query statistics

Inactive Publication Date: 2013-09-11
上海淼云文化传播有限公司
0 Cites 26 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0005] The technical problem to be solved by the present invention is to provide a data processing method, device and system to solve the technical p...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

Apply above-mentioned technical scheme, by the data that inquires is carried out cache processing, when carrying out next query, whether there is the data that satisfies data query request in the query cache data first, when cache data does not include the data that satisfies data query request When the data is stored, the data query is performed from the stored data, thereby avoiding repeated query of the same data, thereby improving the efficiency of data query.
As can be seen above, a kind of data processing method embodiment 4 provided by the present invention, by first storing the data to be stored in the memory with a faster reading and writing speed, and then carrying out the deduplication storage process to the data in the memory, by This memory with high read and write performance can greatly improve the data throughput rate and the data processing speed when storing data, especially mass data storage.
As can be seen from the above, a kind of data processing device embodiment 1 provided by the present invention, by performing hash calculation on the data to be stored before performing columnar storage on the data to be stored, according to the calculated key value, the data to be stored is removed Heavy processing, so as to avoid data redundancy and waste of storage space when processing massive data. At the same time, the data to be stored is deduplicated and stored in a columnar storage method, and the non-empty column data in the two-dimensional data table is acquired, and the acquired data is stored according to the data address sequence of the column data in the two-dimensional data table. Non-null column data is stored to further save storage space.
As can be seen from the above, a kind of data processing device embodiment 2 provided by the present invention, based on the device embodiment 1 provided by the present invention, data is stored and processed in columnar storage method after deduplication, avoiding data redundancy , saving storage space. Further, when performing data query statistics, the data calculation unit 601 first calculates the query key value through the hash algorithm, and then the data search unit 602 judges whether the query key value corresponding to the query condition is in the data set, so as to determine whether the stored data Whether there is data to be queried in the stored data, so that when there is no data to be queried in the stored data, the query is ended to improve the efficiency of data query, and when it is judged that the data to be queried exists, the stored column data is read Fetch query, thereby reducing the moving distance of the magnetic head in the disk, thereby improving the statistical performance of data query.
As can be seen from the above, a kind of data processing device embodiment 4 provided by the present invention will inquire and return the data by data caching unit and carry out caching process, when performing next query, first inquire whether there is satisfying data in the caching data For the data requested by the query, when the cached data does not contain data that meets the data query request, the data query is performed from the stored data, thereby avoiding repeated queries on the same data, thereby improving the efficiency of data query.
As can be seen from the above, a kind of data processing device embodiment five provided by the present invention stores the data to be stored in the memory before deduplicating storage by the data pre-storage unit, and improves the performance of the device of the present invention through the memory of high read-write speed. Data throughput rate, when processing massive d...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention provides a data processing method, device and system. The data processing method comprises the steps that to-be-stored data is subjected to hash algorithm calculation to obtain a key value of the to-be-stored data; and whether a key value matched with the key value is contained is searched in a preset data set, if yes, the to-be-stored data is rejected, otherwise, the to-be-stored data is stored in a column type storage method, and the key value of the to-be-stored data is stored in the data set. According to the data processing method, device and system provided by the embodiment of the invention, the hash calculation is performed before storage of the to-be-stored data in the column type storage method, duplication eliminating processing is performed on the to-be-stored data based on the obtained key value, so that data redundancy during mass data processing is avoided, and storage space waste is avoided.

Application Domain

Technology Topic

Image

  • Data processing method, device and system
  • Data processing method, device and system
  • Data processing method, device and system

Examples

  • Experimental program(2)

Example Embodiment

[0083] reference image 3 , Which shows a flowchart of Embodiment 2 of a data processing method provided by the present invention. Method Embodiment 2 of the present invention is based on Method Embodiment 1, and further includes the following steps:
[0084] Step 301: Receive a data query request, where the data query request includes query conditions.
[0085] Among them, after receiving the data query request input by the user, the query conditions included in the data query request are parsed.
[0086] Step 302: Perform hash algorithm calculation on the query condition to obtain the query key value.
[0087] Among them, it should be noted that when the hash algorithm calculation is performed on the query conditions, the hash algorithm described in the first embodiment of the method of the present invention needs to be calculated.
[0088] Step 303: In the key value of the data set, search whether there is a key value that matches the query key value, if yes, go to step 304, otherwise, go to step 305.
[0089] Wherein, when the query condition is calculated by the hash algorithm, the calculation result is used as the query key value of the query condition, and the query key value is used for data matching in the data set, that is, it is judged that there is Whether the data set of the key value of the stored data contains the key value that matches the query key value, thereby determining whether the data satisfying the data query request has been stored in the stored data.
[0090] Step 304: Find the column data corresponding to the query key value in the stored column data, and return the found column data and the row data corresponding to the column data.
[0091] When the data set contains a key value that matches the query key value of the query condition, it means that the data to be queried has been stored, so that the column corresponding to the query key value is searched for in the stored column data Data, return the found column data and the row data corresponding to the column data.
[0092] Step 305: End this data query.
[0093] When the data set does not contain a key value that matches the query key value of the query condition, it means that the data to be queried is not stored. At this time, the data query is ended.
[0094] Wherein, the query condition is calculated by the hash algorithm described in the first embodiment of the method of the present invention, and the calculated result is used as the query key value, and data matching is performed in the data set according to the query key value. When the data set contains a key value that matches the query key value, it means that the data satisfying the data query request has been stored in a columnar storage method. At this time, the stored column data is searched for the Query the column data corresponding to the key value, and return the found column data and the row data corresponding to the column data. When the data set does not contain the data key value that matches the key value of the query condition, then It indicates that the data to be queried is not stored, and at this time, the data query is ended.
[0095] It can be seen from the above that the second embodiment of a data processing method provided by the present invention is based on the first embodiment of the method provided by the present invention. The data is stored and processed in a columnar storage method after deduplication, thereby avoiding data redundancy and saving storage. Further, when performing data query statistics, first calculate the query key value through the hash algorithm, and then determine whether the query key value corresponding to the query condition is in the data set, so as to determine whether there is data to be queried in the stored data. When there is no data to be queried in the stored data, the query is ended to improve the efficiency of data query. When it is determined that the data to be queried is stored, the stored column data is read and queried, thereby reducing the distance of the head moving in the disk and improving the performance of data query statistics.
[0096] reference Figure 4 , Which shows a flowchart of Embodiment 3 of a data processing method provided by the present invention. Based on Embodiment 1 of the method of the present invention, the method may further include the following steps:
[0097] Step 401: Analyze the data attributes of the stored data, and set data nodes corresponding to the stored data according to the data attributes.
[0098] Wherein, the set data node has a one-to-one correspondence with the stored data, and the data node records statistical information of the stored data, such as a start identifier, an end identifier, a maximum value, and/or a minimum value.
[0099] Step 402: Receive a data query request, where the data query request includes query conditions.
[0100] Step 403: According to the query conditions, the data nodes are divided into related nodes, suspicious nodes, and irrelevant nodes.
[0101] Wherein, the query conditions can carry information related to the statistical information in the above data nodes, and the data nodes are divided according to this information to obtain three types of data nodes, respectively:
[0102] Data nodes that meet the query conditions are regarded as related nodes;
[0103] Data nodes that partially meet the query conditions are regarded as suspicious nodes;
[0104] Data nodes that do not meet the query conditions are regarded as irrelevant nodes.
[0105] Step 404: Query the data node corresponding to the query condition in the suspicious node, and return the stored data corresponding to the data node, and at the same time return the stored data corresponding to the relevant node.
[0106] Wherein, it can be known from step 403 that in the suspicious node, because the query condition is partially satisfied, the data node of the suspicious node is queried for the data node corresponding to the query condition, and the data node corresponding to the query is returned. Corresponding stored data, and all of the data nodes in the related node satisfy the query condition, so the stored data corresponding to the data node of the related node is returned.
[0107] It can be seen from the above that the third embodiment of a data processing method provided by the present invention divides the stored data into blocks according to the data query request, that is, performs data query from the data block that meets the data query request, reducing the number of data queries, thereby increasing Data query statistical performance.
[0108] It should be noted that, based on the above-mentioned method embodiment 2 or method embodiment 3 of the present invention, the method may further include: storing the returned data in a preset cache data set.
[0109] Wherein, after a successful data query is completed, the returned data is stored as cache data in the cache data set. When a data query request is received again, first determine whether the cached data set contains data corresponding to the data query request. If so, it means that the data that meets the data query request has been queried, then Data query is performed directly in the cached data set. If not, it means that the data that meets the data query request has not been queried, and the data query can be performed in the stored data.
[0110] Applying the above technical solution, by caching the queried data, in the next query, first query whether there is data that meets the data query request in the cached data. When the cached data does not contain data that meets the data query request, Data query is then performed from the stored data, thereby avoiding repeated querying of the same data, thereby improving the efficiency of data query.
[0111] reference Figure 5 , Which shows a flowchart of Embodiment 4 of a data processing method provided by the present invention. Based on the foregoing method embodiment of the present invention, before performing hash algorithm calculation on the data to be stored, the following steps may be further included:
[0112] Step 501: Store the data to be stored in a preset memory.
[0113] Wherein, during data storage, the data to be stored is first stored in the memory, and then the data in the memory is deduplicated and stored.
[0114] Steps 502 to 505 are respectively the same as the steps 101 to 104 of the method embodiment of the present invention, which will not be described again.
[0115] It can be seen from the above that the fourth embodiment of a data processing method provided by the present invention first stores the data to be stored in a memory with a faster read and write speed, and then performs deduplication storage processing on the data in the memory, thereby passing high The memory with read and write performance greatly improves the data throughput rate and the data processing speed when data storage is performed, especially when mass data storage.
[0116] reference Image 6 , Which shows a schematic structural diagram of Embodiment 1 of a data processing device provided by the present invention, which is used to implement Embodiment 1 of the above-mentioned method of the present invention. The device may include:
[0117] The data calculation unit 601 is configured to perform hash algorithm calculation on the data to be stored to obtain the key value of the data to be stored.
[0118] Wherein, the data calculation unit 601 can select a hash function algorithm to calculate the data to be stored when performing the hash algorithm calculation. Of course, in order to improve the accuracy of data deduplication and the accuracy of data query, When calculating the hash algorithm for the data to be stored, multiple hash function algorithms may also be selected to calculate the data to be stored separately, and the calculation result is used as the key value of the data to be stored. It should be noted that the data to be stored includes one piece of data or a data block composed of multiple pieces of data.
[0119] The data search unit 602 is configured to receive the key value sent by the data calculation unit 601, and search for whether there is a key value matching the key value in a preset data set.
[0120] The data deduplication unit 603 is configured to discard the data to be stored when the data search unit finds a key value matching the key value in a preset data set.
[0121] The data storage unit 604 is configured to store the data to be stored in a columnar storage method when the data search unit 602 does not find a key value matching the key value in the preset data set The storage is performed, and the key value of the data to be stored is stored in the data set.
[0122] Wherein, the data search unit 602 uses the key value calculated by the data calculation unit 601 to perform data matching in the data set, that is, to determine whether the data set in which the key value of the stored data contains the data to be stored The key value of the data matches the key value, thereby determining whether the data to be stored has been stored in the stored data.
[0123] When the data search unit 602 matches the key value corresponding to the key value of the data to be stored in the data set, it is judged that the data to be stored has been stored. The heavy unit 603 discards the data to be stored. When the data search unit 602 does not match the key value corresponding to the key value of the data to be stored in the data set, it is determined that the data to be stored is not stored. The storage unit 604 stores the to-be-stored data in a columnar storage method, and stores the key value of the to-be-stored data in the data set.
[0124] Wherein, the data storage unit 604 is specifically configured to convert the data structure of the data to be stored into the data structure of a two-dimensional data table, obtain the non-empty column data of each column in the two-dimensional data table, according to the two The data address sequence of the columns in the dimension data table stores the obtained non-empty column data.
[0125] It should be noted that the data is stored logically, that is, the data structure is stored in the data structure of a two-dimensional data table, and when the data is stored on a disk, it is stored in one-dimensional bytes. The row storage method commonly used in the prior art refers to stringing together the data values ​​of each row in the two-dimensional data table and storing them to the disk, and then storing the next row of data, and so on. But when there is empty data in the two-dimensional data table, the empty data is also stored in the disk.
[0126] In the device embodiment of the present invention, the data storage unit 604 adopts a columnar storage method, that is, the non-empty column data of each column in the two-dimensional data table is obtained according to the data address of the column data in the two-dimensional data table. Store the obtained non-empty column data sequentially.
[0127] From the above, it can be seen that the first embodiment of a data processing device provided by the present invention performs a hash calculation on the data to be stored before performing columnar storage of the data to be stored, and performs deduplication processing on the data to be stored according to the calculated key value. This avoids data redundancy and waste of storage space when processing massive data. At the same time, the data to be stored is deduplicated and stored in a columnar storage method, the non-empty column data in the two-dimensional data table is obtained, and the obtained data are obtained according to the data address order of the column data in the two-dimensional data table. Non-empty column data is stored to further save storage space.
[0128] reference Figure 7 , Which shows a schematic structural diagram of Embodiment 2 of a data processing apparatus provided by the present invention. Based on Embodiment 1 of the apparatus of the present invention, the apparatus further includes a receiving unit 605, wherein:
[0129] The receiving unit 605 is configured to receive a data query request, where the data query request includes a query condition, and triggers the data calculation unit.
[0130] Wherein, after receiving the data query request input by the user, the receiving unit 605 parses the query conditions included in the data query request.
[0131] Further, the data calculation unit 601 is further configured to perform a hash algorithm calculation on the query condition to obtain the query key value.
[0132] The data search unit 602 is further configured to find a key value that matches the query key value in the key value of the data set, and if so, search for the query key value in the stored column data Corresponding column data, return the found column data and the row data corresponding to the column data.
[0133] It should be noted that the data calculation unit 601 performs a hash algorithm calculation on the query conditions, and the calculated result is used as the query key value, and the data search unit 602 performs the calculation in the data set according to the query key value. Data matching. When the data set contains the key value that matches the query key value, it means that the data satisfying the data query request has been stored in a columnar storage method. At this time, in the stored column data Find the column data corresponding to the query key value in the query, and return the found column data and the row data corresponding to the column data; when the data set does not contain the key value corresponding to the query key value When it means that the data satisfying the data query request has not been stored, the query can be ended.
[0134] From the foregoing, it can be seen that the second embodiment of a data processing device provided by the present invention is based on the first embodiment of the device provided by the present invention. The data is stored and processed in a columnar storage method after deduplication, which avoids data redundancy and saves storage. Further, when performing data query statistics, the data calculation unit 601 first calculates the query key value through the hash algorithm, and then the data search unit 602 determines whether the query key value corresponding to the query condition is in the data set to determine the stored data Whether there is data to be queried in the stored data, the query is ended when there is no data to be queried in the stored data, and the efficiency of data query is improved. When it is determined that the data to be queried is stored, data read is performed on the stored column data Fetch query, thereby reducing the moving distance of the magnetic head in the disk, thereby improving the performance of data query statistics.
[0135] reference Figure 8 , Which shows a schematic structural diagram of Embodiment 3 of a data processing apparatus provided by the present invention. Based on Embodiment 1 of the apparatus of the present invention, the apparatus may further include:
[0136] The node setting unit 606 is configured to analyze the data attributes of the stored data, and set data nodes corresponding to the stored data according to the data attributes.
[0137] Wherein, the data node set by the node setting unit 606 corresponds to the stored data one-to-one, and the data node records the statistical information of the stored data, such as start identifier, end identifier, maximum value and/or minimum value. Value etc.
[0138] The request receiving unit 607 is configured to receive a data query request, where the data query request includes query conditions.
[0139] The node dividing unit 608 is configured to divide the data node into related nodes, suspicious nodes, and irrelevant nodes according to the query conditions, and trigger the data search unit 602, and the data search unit 602 performs the operation on the suspicious node Query the data node corresponding to the query condition, and return the stored data corresponding to the data node, and return the stored data corresponding to the related node at the same time.
[0140] Wherein, the query conditions can carry information related to some statistical information in the above data nodes, and the data nodes can be classified according to this information to obtain three types of data nodes, respectively:
[0141] Data nodes that meet the query conditions are regarded as related nodes;
[0142] Data nodes that partially meet the query conditions are regarded as suspicious nodes;
[0143] Data nodes that do not meet the query conditions are regarded as irrelevant nodes.
[0144] It can be seen that among the suspicious nodes divided by the node dividing unit 608, because the query conditions are partially satisfied, the data nodes of the suspicious nodes are queried for the data nodes corresponding to the query conditions and returned The stored data corresponding to the data node, and all the data nodes in the related node satisfy the query condition, so the stored data corresponding to the data node of the related node is returned.
[0145] It can be seen from the above that the third embodiment of a data processing device provided by the present invention divides the stored data into blocks according to the data query request, that is, performs data query from the data block that meets the data query request, reducing the number of data queries, thereby increasing Data query statistical performance.

Example Embodiment

[0146] reference Picture 9 , Which shows a schematic structural diagram of Embodiment 4 of a data processing apparatus provided by the present invention. Based on Embodiment 2 of the apparatus of the present invention, the apparatus may further include:
[0147] The data cache unit 609 is configured to store the data returned by the data search unit 602 into a preset cache data set.
[0148] The data search unit 602 stores the returned data as cache data in the cache data set after finishing a successful data query. When the receiving unit 605 receives the data query request again, the data caching unit 609 first determines whether the cached data set contains data corresponding to the data query request, and if so, the data is described If the data requested by the query has been queried, the data search unit 602 can directly query the data in the cached data set. If not, it means that the data requested by the data query has not been queried before. The data searching unit 602 performs data searching in the stored data.
[0149] reference Picture 10 The present invention also provides another schematic structural diagram of Embodiment 4 of a data processing apparatus. Based on Embodiment 3 of the apparatus of the present invention, the apparatus may further include the data caching unit 609.
[0150] Wherein, after the data search unit 602 ends a data query, the data cache unit 609 stores the returned data as cache data in the cache data set. When the request receiving unit 607 receives a data query request again, the data caching unit 609 first determines whether the cached data set contains data corresponding to the data query request. If so, the data query request is described. If the data requested by the data query has been queried, the data search unit 602 can perform data query in the cache data set. If not, it means that the data requested by the data query has not been queried before. The node dividing unit 608 triggers the data searching unit 602 to perform data searching in the stored data.
[0151] It can be seen from the above that the fourth embodiment of a data processing device provided by the present invention caches the queried and returned data through the data caching unit. When performing the next query, first query whether there is any data in the cached data that meets the data query request. Data, when the cached data does not contain data that meets the data query request, data query is performed from the stored data, thereby avoiding repeated querying of the same data, thereby improving data query efficiency.
[0152] reference Picture 11 , Which shows a schematic structural diagram of Embodiment 5 of a data processing apparatus provided by the present invention. Based on the above-mentioned Embodiment 1 of the apparatus of the present invention, the apparatus may further include a data pre-storage unit 610 for storing the data to be stored To the preset memory.
[0153] Wherein, before the data calculation unit 601 performs data calculation on the data to be stored, the data pre-storage unit 610 first stores the data to be stored in the memory, and then the data calculation unit 601 and the data The search unit 602, the data deduplication unit 603, and the data storage unit 604 deduplicate and store the data in the memory.
[0154] It can be seen from the above that the fifth embodiment of a data processing device provided by the present invention uses a data pre-storage unit to store the to-be-stored data in the memory before deduplication and storage, and improves the data throughput rate of the device of the present invention through the memory with high read and write speed. , When processing massive data, significantly improve the efficiency of data processing.
[0155] Corresponding to the above-mentioned device of the present invention, the present invention also provides a data processing system, such as Picture 12 Shown is an architecture diagram of an embodiment of the system of the present invention. The system includes a data processing device 1201, a memory storage device 1202, a cache storage device 1203, and a data storage device 1204, wherein:
[0156] For the structure and function of the data processing device 1201, please refer to the specific description in the above device embodiment, which will not be described in this embodiment. The data storage device 1201 includes a database cluster. The database cluster may specifically be a relational database cluster or a distributed system infrastructure HADOOP database cluster. The relational database cluster may adopt a cluster mode with read-write separation, and a backup database may also be set.
[0157] The memory storage device 1202 is configured to store the data to be stored before performing a hash algorithm calculation on the data to be stored. The memory storage device 1202 may specifically include an in-memory database cluster. The in-memory database cluster can efficiently read and write in a high-concurrency environment, and choose a balance point between reliable persistence and high access performance to the greatest extent. The performance is much higher than that of disk storage. The data processing device 1201 stores the to-be-stored data in the memory database cluster before performing deduplication and storage of the to-be-stored data.
[0158] The cache storage device 1203 is used to store the returned data stored by the data processing device 1201. The cache storage device 1203 may specifically include a cache database cluster.
[0159] The data storage device 1204 is used to store data to be stored in a column storage method.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Classification and recommendation of technical efficacy words

  • Avoid wasting
  • Avoid redundancy

Intelligent file encryption and secure backup system

ActiveUS8667273B1Avoid redundancyEasy to createComputer security arrangementsTransmissionPerformance enhancementExecutable
Owner:DATA LOCKER INC

Financial product real time recommendation method based on random forest algorithm

InactiveCN107507068AImprove data availabilityAvoid redundancyFinanceBuying/selling/leasing transactionsTraining setRandom forest
Owner:广东奡风科技股份有限公司

Intelligent gas cylinder with radio frequency identification tag

ActiveCN105354609AAvoid redundancyImprove the level of safety and securityBottlesContainer/bottle contructionRadio frequencyEngineering
Owner:上海炘璞电子科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products