A data filtering method and apparatus

By dividing and storing data in batches and filtering and deduplicating comparative data before storage, the problem of low data filtering efficiency and insufficient accuracy in existing technologies is solved, and a highly efficient and accurate data filtering process is achieved.

CN114880531BActive Publication Date: 2026-06-30BAIRONG ZHIXIN (BEIJING) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BAIRONG ZHIXIN (BEIJING) TECH CO LTD
Filing Date
2022-05-25
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, data filtering is inefficient and inaccurate. In particular, in multi-channel data processing systems, only one channel can be filtered at a time, which leads to data being repeatedly written to the database, affecting the efficiency of data interception and filtering.

Method used

The data to be stored is divided into batches, and each batch is stored in parallel through multiple transmission channels. Before storage, comparison data is obtained from the target database. The data of each batch is filtered and deduplicated through multiple transmission channels to ensure that the data participates in filtering before being written to the target cache and database.

Benefits of technology

It improves data filtering efficiency and accuracy, avoids data duplication, ensures that all data participates in the filtering process, and enhances the overall efficiency and accuracy of data transmission.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114880531B_ABST
    Figure CN114880531B_ABST
Patent Text Reader

Abstract

This invention provides a data filtering method and apparatus. The method includes: dividing the data to be stored into batches, with each batch of data stored in parallel through multiple transmission channels; before storing the same batch of data, acquiring comparison data in a target database; filtering the data to be written to a target cache in each transmission channel based on the comparison data to obtain multiple first data to be written; deduplicating the multiple first data to be written based on the data in the target cache, and writing the deduplicated data into the target cache to obtain second data to be written; and transferring the second data to be written to the target database through multiple transmission channels. This invention can effectively improve the data filtering speed and ensure the accuracy of data filtering.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a data filtering method and apparatus. Background Technology

[0002] With the rapid development of the internet industry, big data and high concurrency have become extremely common. When storing large amounts of data, due to the large quantity and variety of data, redundancy is significant. Therefore, data interception and verification are often necessary to filter and clean the data. Only filtered and cleaned data can be persistently stored in the database. Current data processing systems process large amounts of data through multiple channels. If a global lock is set, only one channel can clean and filter the data at any given time, resulting in low data filtering efficiency. If a global lock is not set for the channels, it is easy for a situation to occur where one data channel is querying and filtering data in the database, while another data channel is writing data to the database. In this case, the newly written data cannot participate in the interception and filtering, resulting in duplicate data written to the database, leading to inaccurate data interception and filtering, and low interception and filtering efficiency. Summary of the Invention

[0003] In view of this, the present invention provides a data filtering method and apparatus that can effectively improve data filtering efficiency while ensuring data filtering accuracy.

[0004] To achieve the above objectives, the present invention mainly provides the following technical solutions:

[0005] In a first aspect, the present invention provides a data filtering method, the method comprising:

[0006] The data to be stored is divided into batches, and each batch of data is stored in parallel by multiple transmission channels.

[0007] Before storing the same batch of data to be stored, obtain comparison data from the target database;

[0008] Based on the comparison data, the data to be written to the target cache of each transmission channel is filtered to obtain multiple first data to be written.

[0009] Based on the data in the target cache, multiple first data to be written are deduplicated, and the deduplicated data is written into the target cache to obtain second data to be written.

[0010] The second data to be written is transferred to the target database through multiple transmission channels.

[0011] Secondly, the present invention provides a data query device, the device comprising:

[0012] The determination module is used to determine the query parameters and query data source based on the received query request;

[0013] The acquisition module is used to acquire the data to be queried from the query data source using the query parameters;

[0014] The construction module is used to construct a query logic expression corresponding to the query parameters based on the data to be queried;

[0015] The query module is used to obtain query results from the data to be queried based on the query logic expression:

[0016] By employing the above technical solution, this invention provides a data filtering method and apparatus. By dividing the data to be stored into batches and transmitting the batched data through multiple transmission channels, transmission efficiency can be effectively increased. Furthermore, after dividing the data to be stored into batches and before transmitting the batches using multiple transmission channels, comparison data corresponding to the current batch and data in the target cache are obtained from the target database. The data to be written to the target cache for each transmission channel is then filtered and cleaned twice, effectively ensuring the accuracy of data filtering. In the embodiments of this invention, by determining the comparison data corresponding to different batches of data, all data written to the target database effectively participates in the filtering of the data to be stored. This effectively avoids the situation in the prior art where, if a global lock is set, only one channel cleans and filters the data at any given time. Without a global lock, it is easy for a data channel to retrieve comparison data from the database and perform filtering while another data channel is writing data to the database simultaneously. In this case, the newly written data cannot participate in interception and filtering, resulting in duplicate data written to the database.

[0017] The above description is merely an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of the present invention more apparent and understandable, specific embodiments of the present invention are described below. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a flowchart illustrating a data filtering method disclosed in this invention;

[0020] Figure 2 This is a flowchart illustrating a batch division method disclosed in this invention;

[0021] Figure 3 This is a flowchart illustrating a method for determining whether a second piece of data to be written exists in a target cache, as disclosed in this invention.

[0022] Figure 4 This is a flowchart illustrating a method for determining whether all data corresponding to group identifier information has been written into the target database, as disclosed in this invention.

[0023] Figure 5 This is a flowchart illustrating a method for determining and obtaining comparative data disclosed in this invention;

[0024] Figure 6 This is a flowchart illustrating another data filtering method disclosed in this invention;

[0025] Figure 7 This is a flowchart illustrating a method for obtaining first data to be written, as disclosed in this invention.

[0026] Figure 8 This is a schematic diagram of a data filtering device disclosed in this invention;

[0027] Figure 9 This is a schematic diagram of another data filtering device disclosed in this invention. Detailed Implementation

[0028] Exemplary embodiments of the invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this invention will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

[0029] With the rapid development of the internet industry, big data and high concurrency have become extremely common. When storing large amounts of data, it is often necessary to intercept and verify the data to filter and clean it. Only filtered and cleaned data can be persistently stored in the database. Current data processing systems process large amounts of data through multiple channels. If a global lock is set, only one channel can clean and filter the data at any given time. If a global lock is not set for the channels, it is easy for a situation to occur where one data channel is querying and filtering data in the database, while another data channel is writing data to the database. In this case, the newly written data cannot participate in the interception and filtering, resulting in duplicate data written to the database. This leads to inaccurate data interception and filtering, and low interception and filtering efficiency.

[0030] To address the above problems, embodiments of the present invention provide a data filtering method, such as... Figure 1 As shown, the method includes:

[0031] Step 101: Divide the data to be stored into batches, and store each batch of data in parallel through multiple transmission channels.

[0032] Specifically, in the steps of this embodiment, after receiving the user-uploaded data to be stored, the data is divided into batches. Each batch contains several data groups, and each transmission channel transmits data within a single data group at a time. Multiple transmission channels are used simultaneously to transmit data from different data groups. Specifically, the number of data items in each batch can be set by the user or automatically generated. The number of transmission channels is also configured by the user. For example, if the user uploads 10,000 data items to be stored, this embodiment divides the data into two batches: the first batch and the second batch. Each batch contains 5,000 data items. Furthermore, each batch of 5,000 data items is divided into five data groups, each containing 1,000 data items. The number of transmission channels configured by the user is used to simultaneously store the data in the five data groups within each batch. Storing data simultaneously through multiple transmission channels effectively improves data storage efficiency.

[0033] Step 102: Before storing the same batch of data to be stored, obtain the comparison data in the target database.

[0034] Specifically, after executing step 101 and dividing the data into batches, different comparison data corresponding to different batches are read from the target database. Specifically, in the above embodiment, before storing the first batch of data, comparison data corresponding to the first batch of data is first read from the target database; similarly, before storing the second batch of data, comparison data corresponding to the second batch of data is first read from the target database. By obtaining comparison data from different batches, the efficiency of data filtering can be effectively improved while ensuring the accuracy of data filtering.

[0035] Step 103: Filter the data to be written to the target cache of each transmission channel according to the comparison data to obtain multiple first data to be written.

[0036] Specifically, after executing step 102, the comparison data of each batch obtained in step 102 is used to filter the data of that batch, and the data that is duplicated with the comparison data in that batch is deleted to obtain multiple first data to be written. In the specific filtering process, each transmission channel compares each group of data to be transmitted with the comparison data to delete the data that is duplicated with the comparison data in each data group, thereby obtaining the first data to be written corresponding to each group of data.

[0037] Step 104: Based on the data in the target cache, deduplicate the multiple first data to be written, and write the deduplicated data into the target cache to obtain the second data to be written.

[0038] Specifically, after executing step 103, after each transmission channel performs a first filter on each set of data it transmits, each transmission channel sequentially retrieves the data stored in the target cache to perform a second filter on each set of first data to be written. This removes duplicate data from the first data to be transmitted across different transmission channels, ensuring that there is no duplication in the data transmitted across different transmission channels. Simultaneously, each set of first data to be written that has undergone the second filter is written to the target cache, thus obtaining the second set of data to be written. Specifically, during the second filtering process, each transmission channel retrieves data from the target cache sequentially according to its transmission speed. By sequentially retrieving data from the target cache with each transmission, it is ensured that there is no duplication in the data written to the target cache, guaranteeing the accuracy of data filtering and thus improving data transmission efficiency.

[0039] Step 105: Transfer the second data to be written to the target database through multiple transmission channels.

[0040] Specifically, after executing step 104, after all the second data to be written is written to the target cache, the second data to be written stored in the target cache is written to the target database through multiple data channels, and after all the second data to be written is written to the target database, the second data to be written stored in the target cache is deleted.

[0041] Furthermore, embodiments of the present invention also provide a batch partitioning method, which is a method for... Figure 1 The detailed description of the steps before "obtaining comparison data from the target database" in step 102 of the illustrated embodiment is as follows: Figure 2 As shown, the method further includes:

[0042] Step 201: Before any batch of data is stored, determine whether there is a second batch of data to be written in the target cache.

[0043] Specifically, before executing step 102 and before obtaining the comparison data from the target database, it is first determined whether the data in the currently transmitted batch has been completely stored. If it is determined that the data in the current batch has not been completely stored in the target database, the target cache is checked to determine whether there is any second data to be written in the target cache. Specifically, in this embodiment, for example, each batch of data contains 5000 data entries. When transmitting each batch of data, the remaining data in each batch is counted. When the remaining data in the current transmitted batch of data is not 0, it is determined that the current batch of data has not been completely stored. When the data in the current transmitted batch of data has not been completely stored, it is checked in real time whether there is any second data to be written in the target cache. By determining whether there is any second data to be written in the target cache when any batch of data has not been completely stored, the data transmission status can be accurately grasped, thereby improving data transmission efficiency.

[0044] Step 202: Determine the remaining data to be stored in this batch as a new batch and store it.

[0045] Specifically, after executing step 201, if it is determined in step 201 that there is no second data to be written in the target cache, it is determined that the current batch of data transmission has been interrupted. Therefore, when the current batch of data transmission is interrupted, the remaining data to be stored in that batch is determined as a new batch and stored. Specifically, if the current batch of data contains 5000 data entries, and the remaining data to be stored in that batch is 2000, and there is no second data to be written in the target cache, it is determined that the current batch of data transmission has been interrupted. Simultaneously, the data in the target database has changed. Therefore, the remaining data to be transmitted in that batch is determined as a new batch and stored. The comparison data for the newly divided batch is retrieved again. This ensures that comparison data is retrieved before any batch is written to the target database, thus guaranteeing the accuracy of data filtering and further improving data filtering efficiency.

[0046] In step 203, there is no need to re-divide the remaining data to be stored in this batch.

[0047] Specifically, after executing step 201, if it is determined in step 201 that data continues to exist in the target cache, it is determined that the current batch of data has not been interrupted. During the filtering process of the current data, the comparison data in the target database will not change, so there is no need to re-divide the remaining data to be stored in this batch.

[0048] Furthermore, embodiments of the present invention also provide a method for determining whether second data to be written exists in the target cache. This method involves... Figure 2 The specific steps of step 201 in the embodiment, "determining whether there is second data to be written in the target cache before any batch of data is stored," are as follows: Figure 3 As shown, the method includes:

[0049] Step 301: Obtain the group identifier information of the data group transmitted in each of the transmission channels, and store the group identifier information in the identifier information table.

[0050] The identification information table is pre-stored in the target cache, and the group identification information is not present in the pre-stored identification information table. The group identification information is used to identify different data groups in the same batch of data.

[0051] Specifically, when using multiple data transmission channels to transmit batches of data in parallel, the group identifier information of the data groups transmitted by each transmission channel is obtained, and the obtained group identifier information is temporarily stored in a pre-built identifier information table. Specifically, after dividing the data to be stored into batches, each batch of data is further divided into several data groups, and each data group is marked with different group identifier information. When transmitting the data groups summarizing each batch of data using the transmission channels, the group identifier information of each data group is read to determine the data group transmitted by each transmission channel, and the obtained identifier information is stored in the identifier information table. In this embodiment, taking the above embodiment as an example, the group identifier information of the five data groups in the first batch of data is 1, 2, 3, 4 and 5 respectively. When the data group information is transmitted in parallel using multiple data transmission channels, the group identifier information of the data group transmitted by each transmission channel is obtained, and the obtained group identifier information is stored as an identifier information table. For example, if the group identifier information of the data group transmitted by the parallel transmission channels is 1, 2 and 3 respectively, then the identifier information in the identifier information table includes 1, 2 and 3. Or if the group identifier information of the data group transmitted by the parallel transmission channels is 2, 3 and 5 respectively, then the identifier information in the identifier information table includes 2, 3 and 5, and so on.

[0052] Step 302: Determine whether all the data corresponding to each group of identification information in the identification information table has been written into the target database.

[0053] Specifically, after executing step 301, the group identification information transmitted through each transmission channel is stored in the identification information table. It is then determined whether all data corresponding to each group identification information has been written to the target database. If it is determined that all data corresponding to any group identification information has been written to the target database, the corresponding group identification information is deleted from the identification information table. Specifically, in this embodiment, the second data to be written in the target cache contains five group identification information entries: 1, 2, 3, 4, and 5. For example, if it is determined that all data with group identification information 1 has been written to the target database, then the group identification information in the identification information table will be 2, 3, 4, and 5 respectively. Simultaneously, the group identification information for data transmitted through the idle transmission channel is re-acquired, and the re-acquired group identification information is stored in the identification information table.

[0054] Step 303: Delete the corresponding group identification information from the identification information table.

[0055] Specifically, after executing step 302, if it is determined in step 302 that all data corresponding to any group of identification information has been written to the target database, then that group of identification information is deleted from the identification data table. Specifically, in this embodiment, for example, when using three transmission channels for parallel data transmission (a, b, and c), and the identification information table contains three groups of identification information (1, 2, and 3), after it is determined in step 302 that all data corresponding to group identification information 1 has been written to the target database, the group identification information with group identification information 1 is deleted from the identification information table. The resulting identification information table contains entries 2 and 3. Through the identification information table, it is possible to accurately determine whether there is second data to be written in the target cache, thereby accurately determining whether the current batch of data has been interrupted. Furthermore, it is possible to obtain comparative data from different batches of data, ensuring the accuracy of data filtering and improving data filtering efficiency.

[0056] Step 304: When it is detected that the group identification information does not exist in the identification information table, it is determined that there is no second data to be written in the target cache.

[0057] Specifically, after executing step 303, if any set of identifier information is deleted from the identifier information table in step 303, the system continuously checks whether the identifier information table still contains group identifier information. If any set of identifier information exists in the identifier information table, it is determined that there is second data to be written in the target cache, thus indicating that the current batch data transmission has not been interrupted. If there is no group identifier information in the identifier information table, it is determined that there is no second data to be written in the target cache, thus indicating that the current batch data transmission has been interrupted. By continuously checking whether there is group identifier information in the identifier information table, it is possible to accurately determine whether there is second data to be written in the target cache, and thus accurately determine whether the current batch data transmission has been interrupted. When a transmission interruption occurs, the remaining data to be stored in the current batch data is identified as a new batch, so that comparison data can be re-acquired for the new batch data, thereby improving the accuracy and efficiency of data filtering.

[0058] Step 305: Do not delete the group identification information in the identification information table.

[0059] Specifically, after executing step 302, if it is determined in step 302 that all data corresponding to the group identifier information has been written to the target database, then the group identifier information in the identifier information table will not be deleted.

[0060] Furthermore, the present invention provides a method for determining whether all data corresponding to group identifier information has been written to the target database. This method is... Figure 3The detailed description of step 302 in the illustrated embodiment, "determining whether all the data corresponding to each group of identifier information in the identifier information table has been written into the target database," is as follows: Figure 4 As shown, the method includes:

[0061] Step 401: When writing the second data to be written from the target cache to the target database using each of the transmission channels, obtain the group identifier information of the data group transmitted in the transmission channel.

[0062] Specifically, in step 302, the group identifier information of the data groups being transmitted by each transmission channel is obtained, and the group identifier information of the data groups transmitted by each transmission channel at the same time is unique. Specifically, in this embodiment, for example, when using three transmission channels (a, b, and c) to transmit data groups in parallel, the group identifier information of the data groups being transmitted by transmission channels a, b, and c is obtained respectively. If the group identifier information of the data groups being transmitted by transmission channels a, b, and c is 1, 2, and 3 respectively, then the obtained group identifier information of the data groups being transmitted by transmission channels a, b, and c are 1, 2, and 3 respectively. By obtaining the group identifier information of the data groups transmitted by each transmission channel, it is possible to accurately determine whether the data transmission of each data group is complete, thereby accurately determining whether there is second data to be written in the target cache, thus ensuring the accuracy of data filtering and further improving the efficiency of data filtering.

[0063] Step 402: Determine whether the group identifier information of the data group transmitted in the transmission channel has changed.

[0064] Specifically, after executing step 401, based on the group identifier information of the data groups currently being transmitted by each transmission channel obtained in step 401, it is determined whether the group identifier information of the data groups transmitted by each transmission channel has changed. Specifically, taking the above embodiment as an example, when the group identifier information of the data groups being transmitted by transmission channels a, b, and c are 1, 2, and 3 respectively, it is determined whether the group identifier information of the data groups being transmitted by transmission channels a, b, and c has changed. When the obtained group identifier information of the data group being transmitted by transmission channel a is 4, or when there is no group identifier information in transmission channel a, it is determined that the group identifier information of the data group being transmitted by transmission channel a has changed. When the obtained group identifier information of the data group being transmitted by transmission channel a is 1, it is determined that the group identifier information of the data group being transmitted by transmission channel a has not changed. This process is repeated for the other two transmission channels, determining whether the group identifier information of the data groups being transmitted has changed. By judging whether the group identifier information of the data groups transmitted by each transmission channel has changed, it is possible to accurately determine whether the data of each data group has been transmitted. This allows for accurate determination of whether there is second data to be written in the target cache, thereby ensuring the accuracy of data filtering and further improving the efficiency of data filtering.

[0065] Step 403: Determine that all data corresponding to the group identification information mentioned above has been written into the target database.

[0066] Specifically, after executing step 402, if the determination result in step 402 is that the group identifier information of the data group transmitted by any transmission channel has changed, then it is determined that all the data of the data group corresponding to the group identifier information is written to the target database. Specifically, taking the above embodiment as an example, when the group identifier information of the data group being transmitted by transmission channel a changes from 1 to 4, or when the group identifier information in transmission channel a changes from 1 to no group identifier information, then it is determined that all the data of the data group corresponding to group table information 1 is written to the target database.

[0067] Step 404: Determine that not all the data corresponding to the current group identifier information has been written to the target database.

[0068] Specifically, after executing step 402, if the determination result in step 402 is that the group identifier information of the data group transmitted by any transmission channel has not changed, then it is determined that the data of the data group corresponding to the group identifier information has not been completely written to the target database. Specifically, taking the above embodiment as an example, if the group identifier information of the data group being transmitted by transmission channel a is still 1, then it is determined that the data of the data group corresponding to group identifier information 1 has not been completely written to the target database.

[0069] Furthermore, embodiments of the present invention also provide a method for determining and acquiring comparison data. This method is for... Figure 1 The specific steps in step 102 of the embodiment, "obtaining comparison data from the target database before storing the same batch of data to be stored," are detailed below. Figure 5 As shown, the method includes:

[0070] Step 501: Determine whether there is second data to be written in the target cache.

[0071] Specifically, in step 102, before storing the same batch of data, it first checks whether there is second data to be written in the target cache. If data is detected in the target cache, it is determined that there is second data to be written, and it is also determined that the previous batch of data has not been completely stored. If no data is detected in the target cache, it is determined that there is no second data to be written, and it is also determined that the previous batch of data has been completely stored. By determining whether there is data in the target cache, comparison data for each batch of data can be accurately obtained, thereby improving data filtering efficiency and further ensuring the accuracy of data filtering.

[0072] Step 502 involves using the data in the target database as comparison data to be written into the target cache as batch data.

[0073] Specifically, after step 501 is executed, and step 501 determines that there is no second data to be written in the target cache, the data in the target database is updated, and the updated data in the target database is used as the comparison data to be written to the batch data. In this way, it can be ensured that the comparison data for any batch of data is the same comparison data, which can ensure that the data of each batch participates in the filtering, improves the efficiency of data filtering, and ensures the accuracy of data filtering.

[0074] Step 503: After all the second data to be written in the target cache has been written to the target database, the data in the target database is used as comparison data.

[0075] Specifically, after executing step 501, if it is determined in step 501 that there is second data to be written in the target cache and that the previous batch of data has not been completely stored, then the process waits until the previous batch of data transmission is complete, i.e., until there is no second data to be written in the target cache. Then, the data in the target database is used as the comparison data for the accurate batch to be written. This method ensures that the comparison data for any batch of data is the same, guaranteeing that each batch of data participates in filtering, improving the efficiency of data filtering, and ensuring the accuracy of data filtering.

[0076] Furthermore, embodiments of the present invention provide yet another data filtering method, which is a method for... Figure 1The specific description of step 103, "filtering the data to be written to the target buffer of each transmission channel according to the comparison data to obtain multiple first data to be written," shown in the embodiment, is as follows: Figure 6 As shown, the method further includes:

[0077] Step 601: Determine if there are any filtering conditions.

[0078] Specifically, in step 103, a user's request to store data is received, and it is determined whether there are filtering conditions in the received storage request. Specifically, in this embodiment of the invention, by judging the filtering conditions, the data to be stored can be filtered according to the user's actual needs, thereby accurately filtering the data to be stored and effectively ensuring the accuracy of data filtering.

[0079] Step 602: Filter the multiple first data to be written again according to the filtering conditions.

[0080] Specifically, after executing step 601, the judgment result of step 601 is obtained. When the judgment result in step 601 indicates the existence of filtering conditions, multiple first data to be written are filtered again according to the user-configured filtering conditions to obtain data that meets the user's requirements. By using filtering conditions to filter multiple first data to be written again, the data can be accurately filtered, ensuring the accuracy of data filtering.

[0081] Step 603: Perform deduplication on multiple first data to be written based on the data in the target cache, and write the deduplicated data into the target cache to obtain second data to be written.

[0082] Specifically, this step is the same as step 104, and will not be repeated here.

[0083] Furthermore, embodiments of the present invention also provide a method for obtaining first data to be written, the method being... Figure 1 In the illustrated embodiment, step 103, "filtering the data to be written to the target cache of each transmission channel according to the comparison data to obtain multiple first data to be written," is described in detail. The specific steps are as follows: Figure 7 As shown, the method includes:

[0084] Step 701: Determine whether data exists in the target database.

[0085] Specifically, after executing step 102, the target database is checked to determine whether data exists in it. If the target database is empty, it is determined that no data exists in the target database; if the target database is not empty, it is determined that data exists in the target database. Through this method, the comparison data can be accurately identified, thus enabling accurate data filtering and ensuring high accuracy in data filtering.

[0086] Step 702: Do not filter the data to be written to the target cache for each transmission channel.

[0087] Specifically, after executing step 701, and when it is determined in step 701 that no data exists in the target database, there is no comparison data. Therefore, there is no need to filter the data to be written in the batch. By determining whether data exists in the target database and summarizing the absence of data in the target database, the steps of obtaining comparison data and filtering the data to be written in the batch based on the comparison data can be omitted, thereby effectively improving filtering efficiency.

[0088] Step 703: Use the data in the target database as the comparison data, and filter the data to be written to the target cache of each transmission channel according to the comparison data to obtain multiple first data to be written.

[0089] Specifically, this step is the same as step 103, and will not be repeated here.

[0090] Furthermore, as mentioned above Figure 1-7 The implementation of the method embodiment shown in this invention provides a data filtering device that can effectively improve data filtering efficiency and accuracy. This device embodiment corresponds to the foregoing method embodiments. For ease of reading, this embodiment will not repeat the details of the foregoing method embodiments one by one, but it should be clear that the device in this embodiment can implement all the contents of the foregoing method embodiments, specifically as follows: Figure 8 As shown, the device includes:

[0091] The first partitioning module 10 is used to partition the data to be stored into batches, and the data in each batch is stored in parallel by multiple transmission channels.

[0092] The acquisition module 20 is used to acquire comparison data in the target database according to the batches divided by the division module 10, before storing the data to be stored in the same batch.

[0093] The filtering module 30 is used to filter the data to be written to the target cache of each transmission channel based on the comparison data obtained by the acquisition module 20, so as to obtain multiple first data to be written.

[0094] The deduplication module 40 is used to deduplicatize the multiple first data to be written obtained by the filtering module 30 based on the data in the target cache, and write the deduplicated data into the target cache to obtain the second data to be written.

[0095] The writing module 50 is used to transfer the second data to be written obtained by the deduplication module 40 to the target database through multiple transmission channels.

[0096] Furthermore, such as Figure 9 As shown, the device also includes a second dividing module 60, which includes:

[0097] The first judgment unit 610 is used to determine whether there is second data to be written in the target cache before any batch of data is stored.

[0098] The partitioning unit 620 is used to determine the remaining data to be stored in the target cache as a new batch when the first judgment unit 610 determines whether there is second data to be written in the target cache.

[0099] Furthermore, such as Figure 9 As shown, the first judgment unit 610 is further configured to obtain group identification information of data groups transmitted in each of the transmission channels, and store each group identification information in an identification information table, wherein the identification information table is pre-stored in the target cache, and the group identification information is not present in the pre-stored identification information table, and the group identification information is used to identify different data groups in the same batch of data; determine whether all the data corresponding to each group identification information in the identification information table has been written to the target database; and when it is determined that all the data corresponding to each group identification information has been written to the target database, delete the corresponding group identification information in the identification information table; and when it is detected that the group identification information is not present in the identification information table, determine that there is no second data to be written in the target cache.

[0100] Furthermore, such as Figure 9 As shown, the first judgment unit 610 is further configured to, when writing the second data to be written from the target cache to the target database using each of the transmission channels, obtain the group identification information of the data group transmitted in the transmission channel; determine whether the group identification information of the data group transmitted in the transmission channel has changed; if it has changed, determine that all the data corresponding to the previous group identification information has been written to the target database.

[0101] Furthermore, such as Figure 9 As shown, the acquisition module 20 also includes:

[0102] The second judgment unit 210 is used to determine whether there is second data to be written in the target cache;

[0103] The first acquisition unit 220 is used to use data in the target database as comparison data to be written to the batch data in the target cache when the second judgment unit 210 determines that there is no second data to be written in the target cache.

[0104] The second acquisition unit 230 is used to, when the second judgment unit 210 determines that there is second data to be written in the target cache, use the data in the target database as comparison data after all the second data to be written in the target cache has been written to the target database.

[0105] Furthermore, such as Figure 9 As shown, the device also includes a re-filtering module 70, which includes:

[0106] The third judgment unit 710 is used to determine whether a filtering condition exists;

[0107] The re-filtering unit 720 is used to re-filter multiple first data to be written according to the filtering conditions when the third judgment unit 710 determines that there are filtering conditions.

[0108] Furthermore, such as Figure 9 As shown, the filtering module 30 also includes:

[0109] The fourth judgment unit 310 is used to determine whether data exists in the target database;

[0110] The first filtering unit 320 is used to use the data in the target database as the comparison data when the fourth judgment unit 310 determines that data exists in the target database, and to filter the data to be written to the target cache of each transmission channel according to the comparison data to obtain multiple first data to be written.

[0111] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0112] It is understood that the relevant features in the above methods and apparatus can be referenced interchangeably. Furthermore, the terms "first," "second," etc., in the above embodiments are used to distinguish between embodiments and do not represent the superiority or inferiority of any particular embodiment.

[0113] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0114] The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used in conjunction with the teachings herein. The required structure for constructing such systems is apparent from the above description. Furthermore, this invention is not directed to any particular programming language. It should be understood that the contents of the invention described herein can be implemented using various programming languages, and the above description of specific languages ​​is for the purpose of disclosing the best mode of implementation of the invention.

[0115] In addition, the memory may include non-permanent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.

[0116] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0117] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0118] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0119] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0120] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0121] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0122] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0123] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0124] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0125] The above are merely embodiments of the present invention and are not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the claims of the present invention.

Claims

1. A data filtering method, characterized in that, The method includes: The data to be stored is divided into batches, and each batch of data is stored in parallel by multiple transmission channels. Each batch of data is divided into at least one data group, and each transmission channel transmits a single data group of data at a time. When multiple transmission channels are used at the same time, the data groups transmitted by different channels are different. Before storing the same batch of data to be stored, comparison data in the target database is obtained. When the transmission of the current batch of data is interrupted, the remaining data to be stored in the current batch of data is determined as a new batch, and comparison data is re-obtained for the new batch of data. This is to obtain comparison data for different batches of data. The comparison data is obtained from the target database after all the second data to be written in the target cache has been written to the target database. Based on the comparison data, the data to be written to the target cache of each transmission channel is filtered to obtain multiple first data to be written. Based on the data in the target cache, multiple first data to be written are deduplicated, and the deduplicated data is written into the target cache to obtain second data to be written. The second data to be written is transferred to the target database through multiple transmission channels.

2. The method according to claim 1, characterized in that, Before obtaining comparison data from the target database, the method further includes: Before any batch of data is stored, determine whether there is a second batch of data to be written in the target cache; If it does not exist, the remaining data to be stored in that batch will be designated as a new batch and stored.

3. The method according to claim 2, characterized in that, Before any batch of data is stored, determining whether there is second data to be written in the target cache includes: Obtain group identifier information of data groups transmitted in each of the transmission channels, and store each group identifier information in an identifier information table, wherein the identifier information table is pre-stored in the target cache, and the group identifier information is not present in the pre-stored identifier information table; the group identifier information is used to identify different data groups in the same batch of data. Determine whether all the data corresponding to each group of identifier information in the identifier information table has been written into the target database; If so, delete the corresponding group identifier information from the identifier information table; When the group identifier information is not found in the identifier information table, it is determined that there is no second data to be written in the target cache.

4. The method according to claim 3, characterized in that, The step of determining whether all data corresponding to each group of identifier information in the identifier information table has been written into the target database includes: When writing the second data to be written from the target cache to the target database using each of the transmission channels, the group identification information of the data group transmitted in the transmission channel is obtained; Determine whether the group identifier information of the data group transmitted in the transmission channel has changed; If a change occurs, then all data corresponding to the previous group identifier information is written into the target database.

5. The method according to claim 1, characterized in that, The step of obtaining comparison data from the target database before storing the same batch of data to be stored includes: Determine whether there is a second piece of data to be written in the target cache; If it does not exist, the data in the target database will be used as the comparison data to be written into the batch data in the target cache; If it exists, after all the second data to be written in the target cache is written to the target database, the data in the target database will be used as the comparison data.

6. The method according to claim 1, characterized in that, After filtering the data to be written to the target buffer of each transmission channel according to the comparison data to obtain multiple first data to be written, the method further includes: Determine if a filtering condition exists; If it exists, then the multiple first data to be written will be filtered again according to the filtering conditions; If it does not exist, then the multiple first data to be written are deduplicated according to the data in the target cache, and the deduplicated data is written to the target cache to obtain the second data to be written.

7. The method according to claim 1, characterized in that, The step of filtering the data to be written to the target cache of each transmission channel based on the comparison data yields multiple first data to be written, including: Determine whether data exists in the target database; If it does not exist, then the data to be written to the target cache for each transmission channel will not be filtered; If it exists, the data in the target database is used as the comparison data, and the data to be written to the target cache of each transmission channel is filtered according to the comparison data to obtain multiple first data to be written.

8. A data filtering device, characterized in that, The device includes: The partitioning module is used to divide the data to be stored into batches. Each batch of data is stored in parallel by multiple transmission channels. Each batch of data is divided into at least one data group. Each transmission channel transmits a single data group of data at a time. When multiple transmission channels are used at the same time, different channels transmit different data groups. The acquisition module is used to acquire comparison data in the target database before storing the same batch of data to be stored. When the transmission of the current batch of data is interrupted, the remaining data to be stored in the current batch of data is determined as a new batch, and comparison data is reacquired for the new batch of data so as to acquire comparison data for different batches of data. The comparison data is acquired from the target database after all the second data to be written in the target cache has been written to the target database. The filtering module is used to filter the data to be written to the target cache of each transmission channel according to the comparison data, so as to obtain multiple first data to be written; The deduplication module is used to deduplicatize multiple first data to be written based on the data in the target cache, and write the deduplicated data into the target cache to obtain second data to be written. The writing module is used to transfer the second data to be written to the target database through multiple transmission channels.

9. A terminal, characterized in that, The terminal is used to run a program, wherein the terminal executes the data filtering method according to any one of claims 1-7 when it is running.

10. A storage medium, characterized in that, The storage medium is used to store a computer program, wherein the computer program, when running, controls the device where the storage medium is located to execute the data filtering method according to any one of claims 1-7.