Data cleaning method and apparatus
By employing data sharding and multi-threaded execution for data cleanup, this approach solves the problems of table locking and master-slave latency in Java systems, achieving an efficient and personalized data cleanup solution suitable for various application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING WODONG TIANJUN INFORMATION TECH CO LTD
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-19
AI Technical Summary
In Java systems, database cleanup tasks can easily lead to table locking and master-slave latency. Furthermore, each application scenario requires the independent development of a data cleanup program, resulting in poor reusability and an inability to support personalized cleanup configurations.
A data sharding mechanism is adopted, and a personalized solution is formed based on data cleaning components and configuration information. Data is split into multiple sub-tasks through custom cleaning methods, and a multi-threaded executor is used for cleaning to avoid table locking and master-slave latency.
It improves data cleaning efficiency in multiple application scenarios, reduces development workload, avoids table locking and master-slave latency, and supports personalized configuration.
Smart Images

Figure CN122240593A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a data cleaning method and apparatus. Background Technology
[0002] In current Java systems, cleaning up massive amounts of data in databases is a common task. As business grows, the amount of data in a database can increase rapidly, leading to a decline in query and storage performance. Therefore, it's necessary to clean up relatively unimportant data promptly. Current technologies typically use a single SQL delete statement to perform centralized deletion of large amounts of data. This method is prone to table locking, causing service unavailability, and can also lead to long master-slave latency when the database is deployed in a master-slave architecture. Furthermore, current technologies require each application scenario to develop its own data cleaning program to perform data cleaning tasks. These programs have poor reusability and do not support customized cleaning configurations for different scenarios. Summary of the Invention
[0003] In view of this, embodiments of the present invention provide a data cleaning method and apparatus that can form a personalized data cleaning scheme based on data cleaning components adapted to multiple application scenarios and configuration information for specific tasks, and use a data sharding mechanism to perform single cleaning of a small amount of data, avoiding table locking and master-slave latency.
[0004] To achieve the above objectives, according to one aspect of the present invention, a data cleaning method is provided.
[0005] The data cleaning method of this invention includes: determining the target strategy of the current data cleaning task from a variety of data cleaning strategies included in a pre-introduced data cleaning component; obtaining a custom cleaning method for the current data cleaning task formed by a code template of the target strategy provided by the data cleaning component; acquiring the configuration information and data cleaning scope of the current data cleaning task; determining the data to be cleaned for the current data cleaning task according to the configuration information and the data cleaning scope; splitting the data to be cleaned into fragmented data belonging to multiple subtasks according to the configuration information; and executing the custom cleaning method to clean the fragmented data of each subtask according to the configuration information.
[0006] Optionally, the data cleanup method is applied in the Spring framework; and the configuration information of the current data cleanup task is obtained, including: specifying prefix data in the configuration property annotation of the Spring framework, adding prefix data to the configuration information of the preset configuration file, so as to load the configuration information into the Spring framework.
[0007] Optionally, the configuration information includes: the data generation time interval for performing cleanup; the data cleanup scope includes: the value range of a specific field preset in the database table; and determining the data to be cleaned for the current data cleanup task based on the configuration information and the data cleanup scope, including: determining the initial data of the current data cleanup task from the database table based on the data generation time interval; and determining the data to be cleaned from the initial data based on the value range.
[0008] Optionally, the configuration information includes: the maximum amount of data to be cleaned in a single subtask; and splitting the data to be cleaned into fragments belonging to multiple subtasks according to the configuration information, including: forming multiple subtasks based on the data to be cleaned, wherein the amount of fragmented data in a single subtask does not exceed the maximum amount of data to be cleaned.
[0009] Optionally, the configuration information includes: the maximum amount of data to be cleaned in a single database table; and, based on the configuration information, executing a custom cleanup method to clean up the sharded data of each subtask, including: executing a custom cleanup method to clean up the sharded data of each subtask while ensuring that the amount of data in a single cleanup operation for a single database table does not exceed the maximum amount of data.
[0010] Optionally, the configuration information includes: the core number of threads in the thread pool and the maximum number of threads in the thread pool; and the method further includes: after obtaining the preset configuration information of the current data cleaning task, constructing a multi-threaded executor based on the core number of threads in the thread pool and the maximum number of threads in the thread pool in the configuration information, and using the multi-threaded executor to perform cleaning on the sharded data.
[0011] To achieve the above objectives, according to another aspect of the present invention, a data cleaning apparatus is provided.
[0012] The data cleaning apparatus of this invention includes: an entry unit, configured to determine the target strategy of the current data cleaning task from a variety of data cleaning strategies included in a pre-introduced data cleaning component, and obtain a custom cleaning method for the current data cleaning task formed by a code template of the target strategy provided by the data cleaning component; a sharding unit, configured to obtain the configuration information and data cleaning scope of the current data cleaning task, determine the data to be cleaned for the current data cleaning task according to the configuration information and the data cleaning scope, and split the data to be cleaned into shards belonging to multiple subtasks according to the configuration information; and a cleaning unit, configured to execute the custom cleaning method to clean the shards of each subtask according to the configuration information.
[0013] Optionally, the data cleanup method is applied in the Spring framework; and the sharding unit can be further used to: specify prefix data in the configuration property annotation of the Spring framework, and add prefix data to the configuration information of the preset configuration file to load the configuration information into the Spring framework.
[0014] Optionally, the configuration information includes: the data generation time interval for performing cleanup, the data cleanup scope including: the value range of a specific field preset in the database table; and the sharding unit can be further used to: determine the initial data of the current data cleanup task from the database table according to the data generation time interval; and determine the data to be cleaned from the initial data according to the value range.
[0015] Optionally, the configuration information includes: the maximum amount of data to be cleaned in a single subtask; and the sharding unit can be further used to: form multiple subtasks based on the data to be cleaned, wherein the amount of data in a single subtask does not exceed the maximum amount of data to be cleaned.
[0016] Optionally, the configuration information includes: the maximum amount of data to be cleaned in a single database table; and the cleanup unit can be further used to: clean up the sharded data of each subtask by executing a custom cleanup method, provided that the amount of data in a single cleanup operation for a single database table does not exceed the maximum amount of data.
[0017] Optionally, the configuration information includes: the core number of threads in the thread pool and the maximum number of threads in the thread pool; and the cleanup unit can be further used to: after obtaining the configuration information of the current data cleanup task, construct a multi-threaded executor based on the core number of threads in the thread pool and the maximum number of threads in the thread pool in the configuration information, and use the multi-threaded executor to perform cleanup on the sharded data.
[0018] To achieve the above objectives, according to another aspect of the present invention, an electronic device is provided.
[0019] An electronic device according to the present invention includes: one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the data cleaning method provided by the present invention.
[0020] To achieve the above objectives, according to another aspect of the present invention, a computer-readable storage medium is provided.
[0021] The present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the data cleaning method provided by the present invention.
[0022] To achieve the above objectives, according to another aspect of the present invention, a computer program product is provided.
[0023] One computer program product of the present invention includes a computer program that, when executed by a processor, implements the data cleaning method provided by the present invention.
[0024] According to the technical solution of the present invention, the embodiments described above have the following advantages or beneficial effects:
[0025] First, the target strategy for the current data cleaning task is determined from the various data cleaning strategies included in the pre-introduced data cleaning component. A custom cleaning method for the current data cleaning task is then obtained, formed from the code template of the target strategy provided by the data cleaning component. Next, the pre-defined configuration information and data cleaning scope of the current data cleaning task are acquired. Based on the configuration information and data cleaning scope, the data to be cleaned for the current data cleaning task is determined, and the data to be cleaned is split into fragments belonging to multiple subtasks according to the configuration information. Finally, based on the configuration information, the custom cleaning method is executed to clean the fragments of data in each subtask. In this way, a personalized data cleaning solution is formed based on a data cleaning component adapted to multiple application scenarios and configuration information for specific tasks. Users can create personalized data cleaning solutions based on configuration information such as the maximum data volume for a single cleanup, the data generation time interval for cleanup, and the maximum data volume for a single subtask. They can also make appropriate adjustments to the code template provided by the general data cleaning component to form custom data cleaning logic. This achieves a reusable data cleaning solution applicable to multiple application scenarios, reducing the workload of data cleaning program development and improving data cleaning efficiency. In addition, in this embodiment of the invention, the data cleaning scheme based on the data cleaning component can perform sharding of the data to be cleaned according to the pre-configured maximum amount of data to be cleaned in a single subtask. When cleaning the sharded data, the amount of data in a single cleaning operation of a single database table is limited to no more than the pre-configured maximum amount of data to be cleaned in a single operation, thereby avoiding table locking and master-slave latency.
[0026] The further effects of the aforementioned unconventional alternative methods will be explained below in conjunction with specific implementation methods. Attached Figure Description
[0027] The accompanying drawings are provided to better understand the invention and are not intended to unduly limit the scope of the invention. Wherein: Figure 1 This is a schematic diagram of the main steps of the data cleaning method in an embodiment of the present invention; Figure 2 This is a schematic diagram illustrating the specific execution steps of the data cleaning method in this embodiment of the invention; Figure 3 This is a flowchart illustrating the data cleaning method in an embodiment of the present invention; Figure 4 This is a schematic diagram of the components of the data cleaning device in an embodiment of the present invention; Figure 5 This is an exemplary system architecture diagram that can be applied thereto according to embodiments of the present invention; Figure 6 This is a schematic diagram of the electronic device structure used to implement the data cleaning method in the embodiments of the present invention. Detailed Implementation
[0028] The following description, in conjunction with the accompanying drawings, illustrates exemplary embodiments of the present invention, including various details to aid understanding. These details should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the invention. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0029] It should be noted that, unless otherwise specified, the embodiments of the present invention and the technical features thereof can be combined with each other.
[0030] Figure 1 This is a schematic diagram of the main steps of the data cleaning method in an embodiment of the present invention.
[0031] like Figure 1 As shown, the data cleaning method of this embodiment of the invention can be executed by a server, and the specific execution steps are as follows: Step S101: Determine the target strategy for the current data cleaning task from the various data cleaning strategies included in the pre-introduced data cleaning component, and obtain a custom cleaning method for the current data cleaning task based on the code template of the target strategy provided by the data cleaning component. In this embodiment of the invention, the data cleaning component can provide data cleaning strategies applicable to various scenarios, and an applicable target strategy can be adopted when executing the current data cleaning task. Subsequently, the data cleaning component can provide a code template for the target strategy, and the user can adjust or write code based on the code template to form a custom cleaning method adapted to the current data cleaning task.
[0032] Step S102: Obtain the preset configuration information and data cleanup scope of the current data cleanup task. Based on the configuration information and data cleanup scope, determine the data to be cleaned for the current data cleanup task. Divide the data to be cleaned into data fragments belonging to multiple subtasks. In this step, the server can locate the data to be cleaned in the database table based on the user-configured configuration information of the current data cleanup task and the data cleanup scope determined by the user based on the target strategy's code template. Then, based on the configuration information, the server fragments the data to be cleaned to avoid impacting system performance by deleting a large amount of data at once.
[0033] Step S103: Based on the configuration information, execute a custom cleanup method to clean up the fragmented data of each subtask. In this step, the server can execute a user-generated custom cleanup method to clean up the fragmented data.
[0034] By following the steps above, a personalized data cleaning solution based on general-purpose data cleaning components and configuration information can be formed. These general-purpose components can be reused across different data cleaning scenarios and tasks, thereby reducing the workload of data cleaning program development and improving data cleaning efficiency. The fragmented processing described above avoids the concentrated deletion of large amounts of data to be cleaned, preventing system performance from being affected by data cleaning transactions.
[0035] Figure 2 This is a schematic diagram illustrating the specific execution steps of the data cleaning method in an embodiment of the present invention. See [link / reference]. Figure 2 .
[0036] Step S201: Introduce the data cleansing component. In this step, the server introduces a general-purpose data cleansing component. This component contains data cleansing strategies applicable to multiple data cleansing scenarios, as well as code templates for implementing the corresponding strategies. For example, the component might include strategies for order scenarios, warehouse scenarios, and coupon distribution scenarios. It's understood that the component can also build corresponding data cleansing strategies based on the range of data volume to be cleaned. After introducing the component, users can select a data cleansing strategy that matches the data volume of the current cleansing task. The code templates above contain general logic such as data sharding, log collection, and execution status reporting, but do not include specific scenario-based business logic. This business logic can be determined by users in actual scenarios by adjusting the data cleansing component or writing a small amount of code.
[0037] In one exemplary embodiment, the user can input the identifier of the current data cleanup task based on the getName method provided by the data cleanup component, and use the identifier to indicate the relevant data involved in the execution of the current data cleanup task. The above identifier can also be mapped to a trace identifier (TraceID) to locate relevant logs, so as to quickly know the execution status of the current data cleanup task.
[0038] Step S202: Determine the target strategy. In this step, the server determines the target strategy and code template applicable to the current data cleaning task based on the user's selected operations. The user can generate custom cleaning methods and data cleaning scope based on the code model, thereby realizing a customized data cleaning solution.
[0039] Step S203: Determine the custom cleanup method. In this step, the server determines the custom cleanup method formed by the user based on the code template of the target strategy. For example, the custom cleanup method can be an SQL (Structured Query Language) delete statement containing database table names and conditional clauses.
[0040] Step S204: Obtain configuration information. In practical applications, the server obtains the configuration information written by the user to implement a personalized data cleanup scheme. In one embodiment, the server can work on the Spring framework (a known Java development framework). Prefix data can be specified in the Spring framework's configuration property annotation (ConfigurationProperties). This prefix data is added to the configuration information in the preset configuration file, thereby loading the configuration information into the Spring framework for retrieval. The above steps utilize a prefix-based Spring configuration retrieval method, which binds the configuration information to a Java Bean (a concept in Spring) for subsequent retrieval.
[0041] As a preferred option, the above configuration information can include at least one of the following data: the maximum amount of data to be cleaned in a single subtask, the maximum amount of data to be cleaned in a single database table in a single operation, the data generation time interval for performing cleanup, and thread pool configuration data. These data amounts can be the size of the data storage space or the number of related records. Specifically, the maximum amount of data to be cleaned in a single subtask limits the maximum amount of data contained in a single subtask, and the maximum amount of data to be cleaned in a single database table operation limits the maximum amount of data involved in a single database table operation. These two sets of data prevent system performance risks caused by directly performing centralized processing on excessive data. The data generation time interval for performing cleanup limits the generation time range of related data, thereby initially determining the data to be cleaned. For example, this time interval can include: the maximum data generation time and the time span; that is, the maximum data generation time is determined as the end point of the time interval, and the maximum data generation time minus the time span is determined as the start point of the time interval. The thread pool configuration data limits the relevant configuration parameters of the thread pool in multi-threaded scenarios, such as the core number of threads in the thread pool and the maximum number of threads in the thread pool.
[0042] Step S205: Obtain the data cleanup scope. In this step, the user can write relevant code based on the code template to determine the custom, precise data cleanup scope. The server can obtain the above data cleanup scope to perform subsequent processing. For example, the above data cleanup scope may include the value range of a specific field preset in the database table, such as the maximum and minimum values of a specific field. The above specific field can be a primary key or other fields.
[0043] Step S206: Determine the data to be cleaned. In this step, the server determines the data to be cleaned for the current data cleanup task based on the previously obtained configuration information and data cleanup scope. Specifically, the server determines the initial data for the current data cleanup task from the database table based on the data generation time interval, and then determines the data to be cleaned from the initial data based on the value range of a specific field.
[0044] Step S207: Perform data sharding on the data to be cleaned. In this step, the server splits the data to be cleaned into shards belonging to multiple subtasks according to the configuration information. The amount of data in a shard in a single subtask does not exceed the maximum amount of data to be cleaned in a single subtask as specified in the configuration information.
[0045] Step S208: Clean up sharded data. In this step, the server executes a custom cleanup method to clean up the sharded data of each subtask based on the configuration information. Specifically, the server executes the custom cleanup method to clean up the sharded data of each subtask, provided that the data volume of a single cleanup operation on a single database table does not exceed the maximum data volume limit for a single cleanup operation on a single database table specified in the configuration information.
[0046] In one embodiment, the server can use multithreading to perform data cleanup. Specifically, after obtaining the configuration information for the current data cleanup task, the server can construct a multithreaded executor based on the core thread count and maximum thread count of the thread pool in the configuration information, and use the multithreaded executor to perform cleanup on the sharded data. In practical applications, the threads used can be named according to the target strategy, and the corresponding thread pool can be established based on general thread pool task maximum waiting time, thread pool task maximum waiting queue, and other configurations, and support custom rejection mechanisms such as abort and discard.
[0047] During data cleanup using multithreading, if a subtask executes successfully, the server automatically prints the number of data entries cleaned. If a subtask fails, the server automatically prints an execution error log, cancels tasks waiting in the current thread pool, displays a failure message, and automatically shuts down the thread pool. The main thread can block to wait for all subtasks to execute successfully or fail, and then returns the execution result.
[0048] In the technical solution of this invention, a personalized data cleaning scheme is formed based on a data cleaning component adapted to multiple application scenarios and configuration information for specific tasks. Users can form a personalized data cleaning scheme according to configuration information such as the maximum data volume for a single cleaning, the data generation time interval for cleaning, and the maximum data volume for a single subtask. They can also make appropriate adjustments to the code template provided by the general data cleaning component to form custom data cleaning logic. This achieves a reusable data cleaning solution applicable to multiple application scenarios, reducing the workload of data cleaning program development and improving data cleaning efficiency. Furthermore, in this invention, the data cleaning scheme based on the data cleaning component can perform data sharding on the data to be cleaned according to the pre-configured maximum data volume for a single subtask. When cleaning sharded data, the data volume of a single cleaning operation on a single database table is limited to no more than the pre-configured maximum data volume for a single cleaning, thereby avoiding table locking and master-slave latency.
[0049] The following describes a specific embodiment of the present invention; see [link to specific embodiment]. Figure 3 .
[0050] In current Java systems, cleaning up massive amounts of data in databases is a common task. As business grows, the amount of data in a database can increase rapidly, leading to a decline in query and storage performance. Therefore, it's necessary to clean up relatively unimportant data promptly. Current technologies typically use a single SQL delete statement to perform centralized deletion of large amounts of data. This method is prone to table locking, causing service unavailability, and can also lead to long master-slave latency when the database is deployed in a master-slave architecture. Furthermore, current technologies require each application scenario to develop its own data cleaning program to perform data cleaning tasks. These programs have poor reusability and do not support customized cleaning configurations for different scenarios.
[0051] This embodiment uses Spring SPI (Service Provider Interface) for decoupled access, builds a standardized data cleanup template (i.e., code template), adopts the strategy pattern to customize data cleanup methods, and uses Spring prefix configuration to simply and efficiently import data cleanup configuration information. Finally, it completes a standardized execution process with multiple scheduling entry points and personalized cleanup tasks. The development and execution process is as follows.
[0052] The first step is to construct the scheduling task entry point. In this step, a data cleanup component JAR file (Java archive) is imported, inheriting from the abstract scheduling task class `AbstractScheduleJob`, and implementing methods such as `getName` in the abstract class. For example, during task execution, a `TraceId` log can be automatically added to print the task execution time. If a business exception occurs during task execution, the business exception error code and error message will be printed; if an exception not recognized by the system occurs, stack trace information will be printed. Each log will carry the return parameter of `getName` (task identifier) as the start title of the log, used to quickly understand the task execution status.
[0053] The second step is to personalize the scheduling execution. In this step, the user determines the data cleanup configuration information based on Spring Config. The configuration may include: 1) limit: The maximum amount of data that can be cleaned in a single database table; 2) splitLimit: The maximum amount of data to be cleaned up in a single subtask; 3) clearMaxValue: The maximum time it takes for the data to be generated during the cleanup process; 4) range: The time span for performing the cleanup; 5) corePoolSize: The core number of threads in the thread pool; 6) maxPoolSize: Maximum number of threads in the thread pool.
[0054] The third step is to develop the data cleansing logic. In this step, the server implements the data cleansing logic based on the ClearDataStrategyService interface, mainly by implementing the following three methods.
[0055] The first method, `queryClearDataConfig`, returns the configuration information for the current data cleanup task using Spring filters. The second method, `queryClearDataInfo`, queries the data cleanup scope, which can include the maximum and minimum values of specific fields. The third method, `createTask`, executes data cleanup based on a thread pool. Specifically, when this method is executed, it first determines the data to be cleaned based on the configuration information and the cleanup scope. Then, it creates a multi-threaded task, `AbstractUnionTask`, to submit to the thread pool for data cleanup and to execute custom cleanup methods.
[0056] The fourth step is to perform data cleanup. In this step, the target strategy and its code template are first determined from the data cleanup strategies provided by the data cleanup component. If the target strategy cannot be obtained, an exception is thrown. Next, the configuration information of the current data cleanup task is obtained, and a multi-threaded executor, UnionExecutorService, is constructed. The thread is named after the target strategy, and the thread pool configuration is derived from the above configuration information, as well as common configurations such as the maximum waiting time and maximum waiting queue for thread pool tasks. Custom rejection mechanisms are supported. Afterward, the data cleanup scope of the current data cleanup task is queried, and the data to be cleaned is determined based on the configuration information. Finally, the data to be cleaned is split into sub-tasks, and the sub-tasks are submitted to the thread pool for data cleanup.
[0057] During data cleanup using multithreading, if a subtask executes successfully, the server automatically prints the number of data entries cleaned. If a subtask fails, the server automatically prints an execution error log, cancels tasks waiting in the current thread pool, displays a failure message, and automatically shuts down the thread pool. The main thread can block to wait for all subtasks to execute successfully or fail, and then returns the execution result.
[0058] In summary, the data cleanup component in this embodiment encapsulates a unified scheduling and execution entry point and a multi-threaded task executor, enabling users to quickly integrate and monitor task execution status in real time through rich log viewing features. Based on an object-oriented interface design, the data cleanup component only requires importing the JAR file, importing configurations, and implementing the cleanup logic to be used, resulting in loose coupling and high extensibility with business system code. Furthermore, the data cleanup component supports custom cleanup methods adapted to actual business needs and supports complex scenarios such as main thread blocking, exception thread handling, and multi-threaded execution result return.
[0059] It should be noted that the technical solutions of this invention, including the collection, updating, analysis, processing, use, transmission, and storage of user personal information, all comply with relevant laws and regulations, are used for legitimate purposes, and do not violate public order and good morals. Necessary measures are taken to prevent unauthorized access to user personal information data and to safeguard user personal information security, network security, and national security.
[0060] For the foregoing method embodiments, they are described as a series of actions for ease of description. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, and some steps may actually be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential for implementing the present invention.
[0061] To facilitate better implementation of the above-described solutions of the embodiments of the present invention, related apparatus for implementing the above-described solutions is also provided below.
[0062] Please see Figure 4 As shown, the data cleaning device 400 provided in this embodiment of the invention may include: an entry unit 401, a fragmentation unit 402, and a cleaning unit 403.
[0063] The entry unit 401 is used to determine the target strategy of the current data cleaning task from the various data cleaning strategies included in the pre-introduced data cleaning component, and obtain a custom cleaning method for the current data cleaning task formed by the code template of the target strategy provided by the data cleaning component; the sharding unit 402 is used to obtain the pre-set configuration information and data cleaning scope of the current data cleaning task, determine the data to be cleaned in the current data cleaning task according to the configuration information and data cleaning scope, and split the data to be cleaned into sharded data belonging to multiple sub-tasks according to the configuration information; the cleaning unit 403 is used to execute the custom cleaning method to clean the sharded data of each sub-task according to the configuration information.
[0064] In this embodiment of the invention, the data cleanup method is applied to the Spring framework; and the sharding unit 402 can be further used to: specify prefix data in the configuration property annotation of the Spring framework, and add prefix data to the configuration information of the preset configuration file, so as to load the configuration information into the Spring framework.
[0065] Preferably, the configuration information includes: the data generation time interval for performing cleanup, the data cleanup scope including: the value range of a specific field preset in the database table; and the sharding unit 402 can be further used to: determine the initial data of the current data cleanup task from the database table according to the data generation time interval; and determine the data to be cleaned from the initial data according to the value range.
[0066] As a preferred embodiment, the configuration information includes: the maximum amount of data to be cleaned in a single subtask; and the sharding unit 402 can be further used to: form multiple subtasks based on the data to be cleaned, wherein the amount of data in a single subtask does not exceed the maximum amount of data to be cleaned.
[0067] In one embodiment, the configuration information includes: the maximum amount of data to be cleaned in a single database table; and the cleanup unit 403 may be further used to: clean up the fragmented data of each subtask by executing a custom cleanup method, provided that the amount of data in a single cleanup operation for a single database table does not exceed the maximum amount of data.
[0068] Furthermore, in this embodiment of the invention, the configuration information includes: the core number of threads in the thread pool and the maximum number of threads in the thread pool; and the cleanup unit 403 can be further used to: after obtaining the preset configuration information of the current data cleanup task, construct a multi-threaded executor based on the core number of threads in the thread pool and the maximum number of threads in the thread pool in the configuration information, and use the multi-threaded executor to perform cleanup on the sharded data.
[0069] According to the technical solution of this invention, a personalized data cleaning scheme is formed based on a data cleaning component adapted to multiple application scenarios and configuration information for specific tasks. Users can form a personalized data cleaning scheme according to configuration information such as the maximum data volume for a single cleaning, the data generation time interval for cleaning, and the maximum data volume for a single subtask. They can also make appropriate adjustments to the code template provided by the general data cleaning component to form custom data cleaning logic. This achieves a reusable data cleaning solution applicable to multiple application scenarios, reducing the workload of data cleaning program development and improving data cleaning efficiency. Furthermore, in this embodiment of the invention, the data cleaning scheme based on the data cleaning component can perform data sharding on the data to be cleaned according to the pre-configured maximum data volume for a single subtask. When cleaning sharded data, the data volume of a single cleaning operation on a single database table is limited to no more than the pre-configured maximum data volume for a single cleaning, thereby avoiding table locking and master-slave latency.
[0070] Figure 5 An exemplary system architecture 500 is shown where the data cleaning method or data cleaning apparatus of the present invention can be applied.
[0071] like Figure 5 As shown, system architecture 500 may include terminal devices 501, 502, and 503, network 504, and server 505 (this architecture is merely an example; the components included in a specific architecture may be adjusted according to the specific application). Network 504 serves as the medium for providing a communication link between terminal devices 501, 502, and 503 and server 505. Network 504 may include various connection types, such as wired or wireless communication links or fiber optic cables.
[0072] Users can use terminal devices 501, 502, and 503 to interact with server 505 via network 504 to receive or send messages, etc. Various client applications, such as data cleaning applications, can be installed on terminal devices 501, 502, and 503 (for example only).
[0073] Terminal devices 501, 502, and 503 can be various electronic devices with displays that support web browsing, including but not limited to smartphones, tablets, laptops, and desktop computers.
[0074] Server 505 can be a server that provides various services, such as a backend server (for example only) that supports data cleanup applications operated by users using terminal devices 501, 502, and 503. The backend server can process received data cleanup requests and feed back the processing results (such as data cleanup results - for example only) to terminal devices 501, 502, and 503.
[0075] It should be noted that the data cleaning method provided in this embodiment of the invention is generally executed by server 505, and correspondingly, the data cleaning device is generally located in server 505.
[0076] It should be understood that Figure 5 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0077] The present invention also provides an electronic device. The electronic device according to an embodiment of the present invention includes: one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the data cleaning method provided by the present invention.
[0078] The following is for reference. Figure 6 It shows a schematic diagram of the structure of a computer system 600 suitable for implementing an electronic device according to embodiments of the present invention. Figure 6 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
[0079] like Figure 6 As shown, the computer system 600 includes a central processing unit (CPU) 601, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 602 or programs loaded from storage section 608 into random access memory (RAM) 603. The RAM 603 also stores various programs and data required for the operation of the computer system 600. The CPU 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.
[0080] The following components are connected to I / O interface 605: an input section 606 including a keyboard, mouse, etc.; an output section 607 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card such as a LAN card, modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to I / O interface 605 as needed. A removable medium 611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 610 as needed so that computer programs read from it can be installed into storage section 608 as needed.
[0081] In particular, according to the embodiments disclosed in this invention, the processes described in the above main step diagrams can be implemented as computer software programs. For example, embodiments of this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the main step diagrams. In the above embodiments, the computer program can be downloaded and installed from a network via communication section 609, and / or installed from removable medium 611. When the computer program is executed by central processing unit 601, it performs the functions defined in the system of this invention.
[0082] It should be noted that the computer-readable medium shown in this invention can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0083] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0084] The units described in the embodiments of the present invention can be implemented in software or hardware. The described units can also be located in a processor; for example, a processor can be described as including an entry unit, a sharding unit, and a cleanup unit. The names of these units do not necessarily limit the specific unit; for example, the entry unit can also be described as "a unit that provides a custom cleanup method to the cleanup unit."
[0085] In another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The computer-readable medium carries one or more programs, and when the device executes the one or more programs, the steps performed by the device include: determining the target strategy of the current data cleaning task from a variety of data cleaning strategies included in a pre-introduced data cleaning component; obtaining a custom cleaning method for the current data cleaning task formed based on a code template of the target strategy provided by the data cleaning component; acquiring preset configuration information and data cleaning scope of the current data cleaning task; determining the data to be cleaned for the current data cleaning task according to the configuration information and the data cleaning scope; splitting the data to be cleaned into fragmented data belonging to multiple subtasks according to the configuration information; and executing the custom cleaning method to clean the fragmented data of each subtask according to the configuration information.
[0086] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the data cleaning method provided by the present invention.
[0087] In the technical solution of this invention, a personalized data cleaning scheme is formed based on a data cleaning component adapted to multiple application scenarios and configuration information for specific tasks. Users can form a personalized data cleaning scheme according to configuration information such as the maximum data volume for a single cleaning, the data generation time interval for cleaning, and the maximum data volume for a single subtask. They can also make appropriate adjustments to the code template provided by the general data cleaning component to form custom data cleaning logic. This achieves a reusable data cleaning solution applicable to multiple application scenarios, reducing the workload of data cleaning program development and improving data cleaning efficiency. Furthermore, in this invention, the data cleaning scheme based on the data cleaning component can perform data sharding on the data to be cleaned according to the pre-configured maximum data volume for a single subtask. When cleaning sharded data, the data volume of a single cleaning operation on a single database table is limited to no more than the pre-configured maximum data volume for a single cleaning, thereby avoiding table locking and master-slave latency.
[0088] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A data cleaning method, characterized in that, include: The target strategy for the current data cleaning task is determined from the various data cleaning strategies included in the pre-introduced data cleaning component, and a custom cleaning method for the current data cleaning task is obtained based on the code template of the target strategy provided by the data cleaning component. Obtain the preset configuration information and data cleaning scope of the current data cleaning task, determine the data to be cleaned for the current data cleaning task based on the configuration information and the data cleaning scope, and split the data to be cleaned into fragments belonging to multiple subtasks according to the configuration information. Based on the configuration information, the custom cleanup method is executed to clean up the fragmented data of each subtask.
2. The data cleaning method according to claim 1, characterized in that, The data cleanup method is applied in the Spring framework; and the step of obtaining the preset configuration information of the current data cleanup task includes: Specify prefix data in the configuration property annotation of the Spring framework, and add the prefix data to the configuration information of the preset configuration file to load the configuration information into the Spring framework.
3. The data cleaning method according to claim 1, characterized in that, The configuration information includes: the data generation time interval for performing cleanup; the data cleanup scope includes: the value range of a specific field preset in the database table; and the step of determining the data to be cleaned for the current data cleanup task based on the configuration information and the data cleanup scope includes: The initial data for the current data cleanup task is determined from the database table based on the data generation time interval. The data to be cleaned is determined from the initial data based on the value range.
4. The data cleaning method according to claim 1, characterized in that, The configuration information includes: the maximum amount of data to be cleaned in a single subtask; and, the step of splitting the data to be cleaned into fragments belonging to multiple subtasks according to the configuration information includes: Multiple subtasks are formed based on the data to be cleaned, and the amount of data in a single subtask does not exceed the maximum amount of data to be cleaned.
5. The data cleaning method according to claim 1, characterized in that, The configuration information includes: the maximum amount of data to be cleaned in a single database table in a single operation; and, based on the configuration information, executing the custom cleanup method to clean up the fragmented data of each subtask includes: Under the limitation that the amount of data in a single cleanup operation targeting a single database table does not exceed the maximum data amount, the custom cleanup method is executed to clean up the fragmented data of each subtask.
6. The data cleaning method according to claim 1, characterized in that, The configuration information includes: the core number of threads in the thread pool and the maximum number of threads in the thread pool; and the method further includes: After obtaining the preset configuration information of the current data cleaning task, a multi-threaded executor is constructed based on the core thread count and maximum thread count of the thread pool in the configuration information, and the multi-threaded executor is used to perform cleaning on the sharded data.
7. A data cleaning device, characterized in that, include: The entry unit is used to determine the target strategy of the current data cleaning task from the various data cleaning strategies included in the pre-introduced data cleaning component, and to obtain a custom cleaning method for the current data cleaning task formed based on the code template of the target strategy provided by the data cleaning component. The sharding unit is used to obtain the preset configuration information and data cleaning scope of the current data cleaning task, determine the data to be cleaned for the current data cleaning task according to the configuration information and the data cleaning scope, and split the data to be cleaned into shards belonging to multiple subtasks according to the configuration information. The cleaning unit is used to execute the custom cleaning method to clean up the fragmented data of each subtask based on the configuration information.
8. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-6.