Data processing method, apparatus and data management system

By automatically constructing sample datasets through a data management system, the problem of high dataset construction complexity in artificial intelligence model training is solved, and efficient and accurate dataset generation is achieved.

CN115270714BActive Publication Date: 2026-06-23AIBEE (BEIJING) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
AIBEE (BEIJING) TECH CO LTD
Filing Date
2022-08-05
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In the process of training, testing or validating artificial intelligence models, building datasets requires a large amount of manual data screening, labeling and comparison, which is highly complex.

Method used

A data management system is provided, including a label management module, an annotation data management module, and a file management module. By storing label templates and label annotation datasets, it automatically constructs sample datasets, reducing the complexity of sample set construction.

Benefits of technology

The data management system automatically builds sample datasets, reducing manual operations, improving the efficiency and accuracy of dataset construction, and reducing complexity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115270714B_ABST
    Figure CN115270714B_ABST
Patent Text Reader

Abstract

The application discloses a data processing method and device and a data management system. The data management system stores at least one label template, a file set, and a label annotation data set associated with the label template. The method comprises the following steps: determining a target label template selected by a user; obtaining a sample generation requirement input by the user; determining a target number of target label annotation data from the label annotation data set associated with the target label template based on the sample generation requirement; constructing a target sample annotation set containing the target number of target label annotation data; and storing the target sample annotation set as a sample annotation set associated with the target label template, so that each target label annotation data in the target sample annotation set is associated with a target file and the target sample annotation set forms a sample set for model training, testing or verification. The application can improve the convenience of generating a sample set.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and more specifically, to a data processing method, apparatus and data management system. Background Technology

[0002] In the development of artificial intelligence and related applications, large datasets are often required for model training, measurement, or validation.

[0003] The datasets required for model training, testing, or validation consist of labeled files, such as labeled images, documents, or text files. The number of files, file types, and the required label structure in the dataset may vary depending on the model training scenario. Therefore, each time there is a need for model training, testing, or validation, the dataset must be manually constructed. Constructing a dataset requires extensive manual data filtering, labeling, and comparison, resulting in high complexity. Summary of the Invention

[0004] This application provides a data processing method, apparatus, and data management system to reduce the complexity of the sample datasets required for training, testing, and validating generative models.

[0005] To achieve the above objectives, this application provides a data management system, comprising:

[0006] A tag management module is used to obtain and store at least one tag module, wherein the tag template defines the composition structure of the tag;

[0007] The annotation data management module is used to obtain and store at least one label annotation dataset, wherein the label annotation dataset is associated with a label template, and the label annotation dataset includes at least one piece of label annotation data constructed using the label template associated with the label annotation dataset;

[0008] The file management module is used to obtain and store a file set, wherein each file in the file set is associated with a tag annotation data from at least one tag annotation dataset.

[0009] In one possible implementation, the tag management module includes:

[0010] The template acquisition submodule is used to obtain the tag template created by the user, wherein the tag template defines the tag structure that makes up the tag;

[0011] The template storage submodule is used to store the obtained label templates into the template storage space set in the data management system.

[0012] In yet another possible implementation, the template obtains submodules, including:

[0013] The UI display submodule is used to detect tag addition requests and display the tag addition interface.

[0014] The template creation subunit is used to obtain at least one tag item configured by the user in the tag adding interface to form a tag and the data format corresponding to each tag item, so as to obtain the created tag template, wherein the tag template includes the at least one tag item and the data format corresponding to each tag item.

[0015] In yet another possible implementation, the labeled data management module includes:

[0016] The template determination submodule is used to determine the candidate tag templates selected by the user to add annotation data. The candidate tag templates belong to the tag templates stored in the data management system.

[0017] The data acquisition submodule is used to acquire at least one tag annotation data to be added to the candidate tag template;

[0018] The data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if the tag annotation data conforms to the composition structure of the tag defined by the candidate tag template.

[0019] In another possible implementation, the tag template maintained by the tag management module defines at least one tag item that makes up the tag and the data format corresponding to each tag item;

[0020] Specifically, the data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if all the tag items in the tag annotation data belong to the tag items defined in the candidate tag template, and the data format corresponding to the value of the tag item in the tag annotation data matches the data format of the corresponding tag item in the candidate tag template.

[0021] Furthermore, this application also provides a data processing method, including:

[0022] Determine the target tag template selected by the user, wherein the target tag template belongs to at least one tag template stored in the data management system, and the tag template defines the composition structure of the tag;

[0023] Obtain the sample generation requirements input by the user, wherein the sample generation requirements include at least the target number of samples to be generated;

[0024] Based on the sample generation requirements, the target number of target label annotation data are determined from the label annotation dataset associated with the target label template in the data management system. The label annotation dataset associated with the target label template includes at least one label annotation data constructed using the target label template.

[0025] Construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as a sample annotation set associated with the target label template, so that the target files associated with each target label annotation data in the target sample annotation set in the data management system and the target sample annotation set together form a sample set for model training, testing or verification.

[0026] In another possible implementation, after storing the sample annotation set as the sample annotation set associated with the target label template, the method further includes:

[0027] The system receives a request from the terminal to obtain the target sample annotation set, retrieves each target file associated with the target label annotation data in the target sample annotation set from the file set, and obtains a file sample set containing the target files.

[0028] The file sample set and the target sample annotation set are sent as a sample set to the terminal.

[0029] Another possible implementation includes:

[0030] Obtain a label data change request, wherein the label data change request is used to request changes to the label label data in the label label dataset associated with the target label template;

[0031] Based on the annotation data change request, the label annotation data to be changed in the label annotation dataset associated with the target label template is modified.

[0032] In another possible implementation, after modifying the label annotation data to be changed in the label annotation dataset associated with the target label template, the method further includes:

[0033] Obtain a sample set update request, the sample set update request being used to request an update of the target label annotation data in the target sample annotation set associated with the target label template;

[0034] For each target label annotation data in the target sample annotation set, if there is a change in the target label annotation data in the label annotation dataset associated with the target label template, the target label annotation data in the target sample annotation set is modified based on the change method of the target label annotation data in the label annotation dataset associated with the target label template.

[0035] In another possible implementation, the process includes, simultaneously with or after, the modification of the label annotation data to be changed in the label annotation dataset associated with the target label template, and further includes:

[0036] Record the label annotation data to be changed in the historical data record before the change processing of the label annotation dataset associated with the target label template;

[0037] After modifying the target label annotation data in the target sample annotation set, the method further includes:

[0038] A snapshot of the target sample annotation set is stored in the historical data record.

[0039] In another aspect, this application also provides a data processing apparatus, the apparatus comprising:

[0040] The template determination unit is used to determine the target tag template selected by the user. The target tag template belongs to at least one tag template stored in the data management system, and the tag template defines the composition structure of the tag.

[0041] The requirement acquisition unit is used to acquire the sample generation requirement input by the user, wherein the sample generation requirement includes at least the target number of samples to be generated;

[0042] The labeling determination unit is used to determine the target number of target labeling data from the labeling dataset associated with the target label template in the data management system based on the sample generation requirements. The labeling dataset associated with the target label template includes at least one labeling data constructed using the target label template.

[0043] The annotation set generation unit is used to construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as a sample annotation set associated with the first label template, so that the target files associated with each target label annotation data in the target sample annotation set in the data management system and the target sample annotation set together form a sample set for model training, testing or verification.

[0044] As can be seen from the above scheme, this application stores user-pre-built tag templates through a data management system. Each tag template can be associated with tag annotation data using that template. Furthermore, the data management system can store a file set, where each file is associated with a single tag annotation data entry. Based on this, users only need to select the desired tag template according to their tag type requirements and input sample generation requirements. This application can then determine the tag annotation data constituting the sample annotation set from the tag annotation data associated with the user-selected tag template, allowing the annotation sample set and the files associated with the sample annotation set in the file set to be combined into a sample set, thereby reducing the complexity of constructing the sample set. Attached Figure Description

[0045] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0046] Figure 1 A schematic flowchart of a data processing method provided in an embodiment of this application;

[0047] Figure 2 This is a schematic diagram of a process for storing data in a data management system in the data processing method provided in an embodiment of this application;

[0048] Figure 3 A schematic diagram of the tag system interface in the data management system provided in this application embodiment;

[0049] Figure 4 A schematic diagram of the labeled data interface of the data management system provided in this application embodiment;

[0050] Figure 5 Another flowchart illustrating the data processing method provided in this application embodiment;

[0051] Figure 6 A schematic flowchart of a data processing method provided in an embodiment of this application;

[0052] Figure 7 This is a schematic diagram of the composition architecture of a data management system provided in an embodiment of this application.

[0053] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings are used to distinguish similar parts and are not necessarily used to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in a sequence other than that illustrated herein. Detailed Implementation

[0054] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0055] In the solution of this application, a data management system is used to maintain label templates, label annotation data and files. The data management system can also manage and maintain sample sets used for model training, testing or validation.

[0056] To facilitate understanding of the solution proposed in this application, the data management system of this application will be introduced first.

[0057] In this application, the data management system can obtain and store at least one label template, a file set, and at least one label annotation dataset.

[0058] The tag template defines the structure of a tag. It is not the tag data used to annotate a file, but rather a description of the compositional characteristics of a tag. For example, the tag template can define the individual tag items a tag must contain, the relationships between tag items, and the supported numeric types or data formats for each tag item.

[0059] For example, a tag template can be a template used to describe the tag "person". The tag template can be {person: {gender; age;}}. This tag template indicates that the tag is a tag for people, and the main tag item is "person". The main tag item can include two sub-tag items: "gender" and "age".

[0060] A label annotation dataset is associated with a label template. In other words, a label annotation dataset is a collection of label annotation data added under a specific label template. Correspondingly, each piece of label annotation data in the label annotation dataset is label annotation data using the label template associated with that label annotation dataset. A label annotation dataset may include one or more pieces of label annotation data constructed based on a certain label template.

[0061] Tag annotation data can also be simply referred to as tags. Tag annotation data using a tag template refers to tags constructed according to the tag composition structure defined by that tag template. A tag is a data format used to describe the characteristics of a business entity.

[0062] For example, if a label template defines the label items that make up a label, then the label annotation data contains the data of each label item in the label model and the corresponding value of the label item. For example, with the label template above as {person: {gender;, age;}}, the label annotation data using this label template could be {person: {gender: male; age: 35;}}.

[0063] It's understandable that the label template can define the data format of each label item that makes up the label. Therefore, the data format of the label item values ​​in the label annotation data needs to match the corresponding label data format in the label template. Of course, other characteristics of the label can also be defined in the label template, and the corresponding label annotation data also needs to have the corresponding characteristics, which will not be elaborated further.

[0064] Each file in the file set is associated with a single labeling record in at least one labeling dataset stored in the data management system.

[0065] The files in the file set can be documents, images, text, or videos, etc., without any restrictions.

[0066] The tagging data associated with a file is the tagging data used to annotate that file. For example, if the file is a user's image, then that file can be associated with tagging data of the person who was tagged, and this tagging data can include specific information such as the person's gender and age.

[0067] In this application, the data management system can be based on a microservice architecture and deployed in a server cluster or cloud server, etc., without any restrictions.

[0068] Based on the above, the data processing method of this application is explained in conjunction with a flowchart.

[0069] like Figure 1 This illustration shows a flowchart of a data processing method according to this application. The method of this embodiment is applied to a data management system; it can also be applied to computer devices outside the data management system, such as servers outside the data management system or system platforms composed of computer devices, etc. The method of this embodiment may include:

[0070] S101, Determine the first tag template selected by the user.

[0071] The first tag template belongs to at least one tag template stored in the data management system.

[0072] The first tag template is the tag template selected by the user to build the sample set. For example, after detecting a sample set construction request, the first tag template selected by the user is determined.

[0073] It is understood that, in order to distinguish it from the tag templates that need to be modified later, the tag template selected by the user from the data management system is referred to as the first tag template in this application. In this application, the first tag template selected by the user can also be referred to as the target tag template selected by the user.

[0074] In one possible implementation, after a sample construction request is detected, a tag operation interface can be displayed. This tag operation interface can display information about tag templates stored in the data management system, and correspondingly, the tag template selected by the user in the tag operation interface can be obtained.

[0075] For example, the tag operation interface may include a template preview area, which can display at least some of the tag templates stored in the data management system. Users can adjust the tag templates displayed in the template preview area by dragging or flipping pages, and finally select the tag template required to generate the sample set. Alternatively, the template preview area may also include a template search bar, in which users can enter the serial number or keywords of the required tag template, and then select the tag template from the search results.

[0076] S102, Obtain the sample generation requirements input by the user.

[0077] The sample generation requirement describes the characteristics of the sample set that the user needs to generate. In this application, the sample generation requirement includes at least the target number of samples to be generated.

[0078] In one possible implementation, the sample generation requirement may also include the selection method of the selected data, such as random selection or sequential selection.

[0079] Of course, the sample generation requirements can also include the conditions that the label annotation data must meet. For example, the sample generation requirements may include requirements regarding the generation time of the label annotation data, and there are no restrictions on this.

[0080] It is understandable that the user input sample generation requirement can be generated either before or after the first tag template is selected, without any restriction.

[0081] For example, in one possible scenario, the sample generation requirement can be carried within the sample set construction request. For instance, taking the data management method applied to a data management system as an example, the main interface displayed by the data management system could include a sample set construction option. After detecting that the user clicks on this option, the sample set construction interface is displayed. Of course, if this method is applied to computer devices outside of a data management system, the computer device can also display the sample set construction interface.

[0082] Users can input their sample generation requirements for the desired sample set on the sample set construction interface. These requirements may include the name of the sample set, the target number of samples to be generated, and other descriptive information. Once the user clicks the "Requirement Generation" option on the sample set construction interface, a sample set construction request is confirmed. This request can include the user-inputted sample generation requirements. Based on this, the data management system or other computer devices can obtain the sample generation requirements and display a tagging interface for the user to select the tag templates needed for the constructed sample set.

[0083] S103, based on the sample generation requirements, determine the target number of target label annotation data from the label annotation dataset associated with the first label template.

[0084] For ease of distinction, the label annotation data selected from the label annotation dataset associated with the first label template is called the target label annotation data.

[0085] For example, if the sample generation requirements do not specify the selection method for the label annotation data, the target number of label annotation data can be randomly selected from the label annotation dataset associated with the first label template.

[0086] For example, if the sample generation requirement specifies the sequential selection of label annotation data, then the target label annotation data with the earlier sequence can be selected according to the order of each label annotation data in the label annotation dataset associated with the first label template.

[0087] Of course, when the sample generation requirements include other requirements, the target label data can be selected based on the specific requirements without any restrictions.

[0088] It is understood that the label annotation dataset associated with the first label template includes at least one label annotation data constructed using the target label template.

[0089] S104, construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as the sample annotation set associated with the first label template, so that the target files associated with the target label annotation data in the target sample annotation set in the data management system and the target sample annotation set form a sample set for model training, testing or verification.

[0090] It is understandable that the target label annotation data selected from the label annotation dataset associated with the first label template is actually the label annotation data required for model training, testing, or validation.

[0091] Because there are relationships between labeled data and files in the data management system, once the labeled data is determined, the associated files are also determined. Therefore, when constructing the target sample label set, the files used for model training, validation, or testing can actually be determined.

[0092] In this application, after generating a sample annotation set containing the label annotation data associated with the first label template, the sample annotation set is stored in association with the first label template, such as in a data management system or a computer device, so that the data management system or the computer device can manage and maintain the sample annotation set, thereby allowing users to perform operations such as querying, modifying and downloading the sample annotation set from time to time through the data management system or the computer device.

[0093] Understandably, after generating the sample annotation set, model training, testing, and validation can be performed based on this set and its associated files. Furthermore, users with access permissions can query the generated sample annotation sets under different label templates at any time. Simultaneously, some authorized users can also request to download the sample annotation set under a specific label template through their terminals.

[0094] Accordingly, after receiving a request from the terminal to obtain the target sample annotation set, the system can retrieve the target files associated with the target label annotation data in the target sample annotation set from the file set, thus obtaining a file sample set containing each target file. The data management system or computer equipment may also send this file sample set and the target sample annotation set as a sample set to the terminal.

[0095] As shown in the above scheme, this application maintains label templates through a data management system. These label templates can be associated with label annotation data that uses them. Furthermore, the data management system can store a file set, where each file is associated with a single piece of label annotation data. Based on this, users only need to select the desired label template from the data management system according to their label type requirements and input their sample generation needs. This application can then determine the label annotation data constituting the sample annotation set from the label annotation data associated with the user-selected label template, thus combining the annotated sample set with the files associated with the sample annotation set in the file set to form a sample set, thereby reducing the complexity of constructing the sample set.

[0096] In addition, after generating a sample annotation set based on the label template, this application stores the sample annotation set as the sample annotation set associated with the label template. Therefore, users can query and retrieve the sample annotation set at any time through the terminal, thereby realizing the management and maintenance of the sample annotation set based on the data management system and avoiding the complexity caused by manual maintenance of the sample annotation set.

[0097] It is understandable that the label templates stored in the data management system, as well as the label annotation data associated with those templates and related files, can all be uploaded and stored by users as needed. Based on this, in this application, users can flexibly add label templates to the data management system and use them to generate label annotation data, etc., according to their actual needs.

[0098] For ease of understanding, the process of storing data in the data management system in this application is described below.

[0099] like Figure 2 As shown, this illustrates a flowchart of storing data in a data management system. The method in this embodiment is applied to a data management system and may include:

[0100] S201, Obtain the user-created tag template.

[0101] The tag template defines the tag structure that makes up the tag.

[0102] For example, users can log in to the data management system via a terminal and upload the created tag templates to the data management system.

[0103] In one possible implementation, the data management system can display a tag addition interface after detecting a tag addition request. Accordingly, the data management system can obtain at least one tag item configured by the user in the tag addition interface to form a tag, as well as the data format corresponding to each tag item, to obtain the created tag template.

[0104] For example, users can log into the data management system via a terminal. The data management system can then display its main interface to the terminal. Within this main interface, users can select a tag system interface related to a tag template. The tag system interface displays the tags currently existing in the data management system, as well as a tag addition button to trigger tag creation.

[0105] like Figure 3 As shown, it illustrates a schematic diagram of the tag system interface in the data management system of this application.

[0106] Depend on Figure 3 As can be seen, the labeling system displays multiple labels 301, such as... Figure 3 The interface shows four labels: Label 0, Label 1, Label 2, and Label 3. The label system interface also displays a label addition button 302. If a user clicks this button, the data management system will detect the label addition request and display the label addition interface to the terminal. Users can then configure the various components and related settings of the labels on the label addition interface to ultimately generate and store a label template.

[0107] In one possible implementation, the tag template may include: at least one tag item and the data format corresponding to each tag item.

[0108] A tag item refers to the individual components that make up a tag. The data format of a tag item refers to the data format that the value in that tag item must meet. For example, some tag items have an integer data format, while others may have a string data format.

[0109] The label template can be set according to the actual needs of the training model. Correspondingly, the data format of the data items in the label template can also be set according to the data format requirements of the training model, without any restrictions.

[0110] Understandably, during the model training, testing, and validation processes, the labeling data required for the files can generally be a tree structure. That is, at least one label item in the labeling data can include label items at multiple levels.

[0111] Specifically, label annotation data can include: a main label item and sub-label items at all levels under the main label item. Correspondingly, creating a label template is equivalent to creating a tree-structured label template. The main label item is unique and represents the category of the label annotation data. The main label item can have one or more first-level sub-label items, and each level of sub-label item can have one or more next-level sub-label items. The names of sub-label items at the same level cannot be duplicated. Furthermore, generally, the names of main label items in different label templates should also not be duplicated.

[0112] For example, let's take the tag template {person:{gender;,age;}} as an example. The main tag item in this tag template is "person", and under the main tag item are two first-level sub-tag items "gender" and "age". Of course, there are no other sub-tag items under the first-level sub-tag item in this tag template. However, in practical applications, the sub-tag items in the tag template can have other possibilities as needed, and the hierarchical relationship between the sub-tag items can also be more complex.

[0113] In one possible implementation, to more easily associate the tag annotation data using the tag template with a file, at least one tag item in the tag template includes an information digest item, which is an information digest used to store the file associated with the tag. For example, the information digest of the file can be the MD5 value of the file.

[0114] Correspondingly, the value of the information digest item in the tag annotation data using this tag template is the information digest generated by the file associated with the tag annotation data. Since the information digest of a file is unique, the file associated with the tag annotation data can be determined by combining the value of the information digest item in the tag annotation data.

[0115] In another possible implementation, considering that some items in a tag must have numerical values, while the values ​​of other tag items can be set to default values ​​or determined by the user when creating the tag, the tag template can also include required tag items. Required tag items are those whose values ​​cannot be empty. For example, suppose that in the tag "person," "gender" is a required tag item, but "education level" and "hobbies" are optional.

[0116] It should be noted that in practical applications, users can create one tag template at a time, or they can create multiple tag templates at the same time and then apply for storage simultaneously, etc. There are no restrictions on this.

[0117] S202, store the obtained label template in the template storage space set in the data management system.

[0118] For example, a region can be created in the database of the data management system to store label templates, and then the label templates uploaded to the data management system can be stored in this region.

[0119] In one possible implementation, before storing the label templates in the template storage space of the data management system, this application will also serialize the label templates to be stored. In this application, serialization can be understood as: storing the label templates after adding a template identifier to each label template.

[0120] Adding a template identifier to the tag template for serialization and storage is to ensure that the tag annotation data using that template is not modified when the tag template is modified later.

[0121] Correspondingly, after the data management system stores the tag template, if the data management system receives a template modification request for the tag template, it can modify one or more of the tag items and the corresponding data formats in the tag template based on the template modification request.

[0122] For example, suppose a tag template contains the tag item "age". Later, a user may want to change "age" to "age". The user can find the tag template in the tag template interface of the data management system and change the name of the tag item "age" to "age". After the user submits the modified tag template, the data management system will update the stored tag template with the latest modified tag template.

[0123] It should be noted that this embodiment is for illustrative purposes only, and it also applies to storing tag templates in the data management system in other ways.

[0124] S203, confirm that the user has selected the second label template for adding annotation data.

[0125] This second label template belongs to the label templates stored in the data management system. Specifically, to distinguish it from the label template selected for generating the sample annotation set, the label template for which annotation data needs to be added is referred to as the second label template. In this application, the label template for which the user selects to add annotation data can also be referred to as a candidate label template; that is, the second label template can also be referred to as a candidate label template.

[0126] It is understandable that, given that the data management system stores label templates, users can request to add label annotation data to the label templates at any time as needed. Therefore, if the second label template is the same as the label template to be added in step S201, the order of step S203 is not limited to steps S201 and S202. Figure 2 As shown.

[0127] S204, obtain at least one tag annotation data to be added to the second tag template.

[0128] The label annotation data to be added to the second label template is the label annotation data generated using that template. Accordingly, this label annotation data can include each label item defined in the second label template and its value. Furthermore, the data format of each label item's value in the label annotation data is consistent with the data format of the corresponding label item in the second label template.

[0129] Understandably, users can add one or more tag annotations to a tag template each time. Typically, users will upload multiple tag annotations to a tag template at the same time.

[0130] For example, in the main interface output by the data management system, if the system detects that a user has selected a data annotation interface, the annotation interface will be displayed. This interface can show a list of tags, which may include the names of various tag templates stored in the data management system. If the system detects that a user has clicked on a tag template, it determines that the tag template is the one the user has selected to add annotation data. Correspondingly, if the system detects that a user has clicked on the "Add Data" option in the data annotation interface, it can retrieve at least one tag annotation data that the user has selected to add.

[0131] like Figure 4 As shown, the labeling data interface of the data management system includes a template preview area 401. This label preview area can display the information of each stored label template. For example, if a label template can be represented by the main label item of the label template, then the template preview area can display the main label item 402 of each label template.

[0132] After a user selects a main label item, the data management system will display information about all label annotation data associated with that main label item's label template in the label data interface, such as... Figure 4 The annotation data display area 403 is shown in the figure. Above the annotation data display area 403, there is an operation item "Add Data". If the user clicks this operation item 404, the data management system can display an annotation data addition window. The user can then select the storage file for the desired label annotation data through this window, and upload at least one piece of label annotation data to the data management system.

[0133] S205, for each piece of label annotation data in the at least one piece of label annotation data, if the label annotation data conforms to the composition structure of the label defined by the second label template, the label annotation data is stored in the label annotation dataset associated with the second label template.

[0134] The label annotation dataset associated with the label template can be stored in the annotation data storage area of ​​the data management system. For example, the annotation data storage area can be a storage area in the database used to store annotation data.

[0135] In this application, when uploading tag annotation data associated with a tag template, the data management system will detect whether the tag annotation data matches the composition structure of the tags defined by the tag template. Only if the tag annotation data matches the composition structure of the tags defined by the tag template will the tag annotation data be stored in the tag annotation dataset associated with the tag template.

[0136] In one possible implementation, if the label template defines at least one label item that makes up the label and the data format corresponding to each label item, then the data management system may need to check whether the data format of each label item involved in the label annotation data and the value of the label item are consistent with the definition of the label template.

[0137] Accordingly, if all the label items in the label annotation data belong to the label items defined in the second label template, and the data format corresponding to the value of the label item in the label annotation data matches the data format of the corresponding label item in the second label template, the label annotation data is stored in the label annotation dataset associated with the second label template.

[0138] For example, suppose the second label template is: {person:{age:,gender:,MD5:}}, where age is an integer, and gender and MD5 are strings. If the label annotation data is: {person:{age:20,gender:male,MD5:#####}}, then this label annotation data conforms to the label definition of the second label template, and the label annotation data can be stored in the label annotation dataset associated with the second label template. If the label annotation data is: {person:{height:180,gender:male}}, then this label annotation data does not conform to the label definition of the second label template, and the label annotation data will not be stored.

[0139] In one alternative approach, if the label template contains some required label items and some optional label items, then only the values ​​of the required label items need to be included in the label annotation data, without needing to know whether the optional label items and their values ​​are included in the label annotation data.

[0140] Based on this, after the data management system in this application determines that the main label item of the label annotation data is consistent with the main label item of the second label template, it can further classify the label annotation data into the following situations:

[0141] If the label annotation data contains sub-label items that are not in the second label template, or does not contain at least one required sub-label item in the second label template, the label annotation data will be identified as unqualified data and will not be stored in the label annotation dataset associated with the second label template.

[0142] The required sub-label items in the label template can be set as needed. In particular, considering that the label annotation data can be associated with the file through the MD5 value, the required sub-label items in the label template can include the MD5 sub-label item.

[0143] If the label items contained in the label annotation data are consistent with the label items defined in the second label template, but the data format of the label items in the label annotation data is inconsistent with the data format of the label items defined in the second label template, the label annotation data will be identified as unqualified data and will not be stored in the label annotation data associated with the second label template.

[0144] If the data format of each label item and its value contained in the label annotation data conforms to the definition in the second label template, but the label annotation data does not contain at least one non-mandatory sub-label item and its value in the second label template, the non-mandatory label item can be added to the label annotation data and the value of the non-mandatory label item can be set to the default value to obtain the reconstructed label annotation data; then, the reconstructed label annotation data is stored in the label annotation dataset associated with the second label template.

[0145] Of course, if the label annotation data contains every label item defined in the second label template, and the data format corresponding to the value of each label item in the label annotation data is consistent with the data format corresponding to the corresponding label item in the second label template, then the label annotation data can be directly stored in the label annotation dataset associated with the second label template.

[0146] For example:

[0147] Assuming the second tag template is: {person:{age:,gender:,education:}}, and assuming gender is a required sub-tag, while education and age are optional sub-tags, then the obtained tag annotation data 1 is {person:{age:20,gender:female}}, and the tag annotation data 2 is {person:{age:30,education:bachelor}}.

[0148] As can be seen, the main label in label data 1 is consistent with the main label item defined in the second label template, and all the sub-label items in label data 1 belong to the sub-label items defined in the second label template, and the data format meets the requirements. However, label data 1 lacks the non-sub-label item "education level" defined in the second label template. Therefore, we can set the education level in label data 1 to the default value: "high school", thus changing label data 1 to {person: {age: 20, gender: female, education level: high school}}. Then, we can store the changed label data 1 in the label data set associated with the second label template.

[0149] If the label annotation data 2 is missing the required sub-label item "gender" and its value in the second label template, then the label annotation data will not be stored in the label annotation dataset associated with the second label template.

[0150] It is understandable that if the label annotation data conforms to the label composition structure defined in the second label template, this application can also detect whether the label annotation data belongs to the label annotation data already stored in the label annotation dataset of the second label template. If so, the label annotation data is used to overwrite the corresponding label annotation data in the label annotation dataset associated with the second label template. At the same time, the overwritten label annotation data can also be stored in the historical data record.

[0151] Among them, the label annotation data belonging to the stored label annotation data can be that the data identifier of the label annotation data is the same as the data identifier of the stored label annotation data, but the content is different; or, the content of the label annotation data is the same, etc.

[0152] Understandably, storing tag-related data as tag annotation data associated with the second tag template can be achieved by establishing a link between the tag annotation data and the template identifier of the second tag template. Based on this, if the second tag template is modified, only the content of the second tag template needs to be changed. Furthermore, because the tag annotation data is associated with the template identifier of the second tag template, any modification to the second tag template will automatically modify the corresponding tag items in the associated tag annotation data. For example, assuming the second tag template is {person:{age:, gender:}}, and the tag annotation data using this template is {person:{age:20, gender:female}}, then if the second tag template is modified to {person:{age:, gender:}}, the tag annotation data will also be automatically modified to {person:{age:20, gender:female}}.

[0153] As can be seen from the above, the data management system of this application can store user-uploaded tag templates. Simultaneously, users can store tag annotation data associated with tag templates into the data management system. The system automatically checks whether the tag annotation data conforms to the tag template's definition of a tag. Only when the tag annotation data conforms to the tag template's definition will the tag annotation data be stored in the tag annotation dataset associated with the tag template. This effectively reduces the possibility of tag annotation data errors and avoids the complexity caused by users manually verifying tag annotation data.

[0154] It is understood that, in this application, users can also upload files associated with tag annotation data to the data management system as needed. Specifically, the data management system can obtain at least one file to be stored and store the file in a file set. The tag annotation data associated with the file is tag annotation data containing an information digest generated based on that file.

[0155] For example, assuming that all tag annotation data contains the value of the MD5 tag item, if the MD5 value in the tag annotation data is consistent with the MD5 value generated by a certain file, then the tag annotation data is the tag annotation data associated with that file.

[0156] Understandably, after storing label templates, label annotation data, and files in the data management system, users can request to query these materials as needed. Correspondingly, upon receiving a user's query request, the data management system can output the requested label template, the associated label annotation data, or the files associated with the label annotation data.

[0157] Understandably, to enhance the security of data stored in the data management system, this application can also set different access permissions for different users. For example, for the administrator of the data management system, permissions can be set to create tag templates in the data management system, access and modify any data in the data management system, and browse permissions for individual files, tag annotation data, and annotation sample sets. Other users can be configured to have permissions to query, retrieve, and modify data, or only have permissions to query and retrieve data, depending on their actual needs.

[0158] Accordingly, in this application, after receiving a data query request, the data management system will first check whether the user who triggered the data query request has data query permission. Only if the user has the data query permission will the data query request be affected.

[0159] Similarly, when the data management system receives requests to create tag templates or upload tag annotation data, it must confirm that the user has the appropriate permissions before responding to the requests.

[0160] It is understood that, in the above embodiments of this application, after the data management system stores the label template, the label annotation data associated with the label template, and the files associated with the label annotation data, users with the necessary permissions can modify the corresponding data as needed.

[0161] In one possible implementation, considering that after the sample annotation set is generated, the label annotation data in the sample annotation set may be used for model training, testing or validation, in this case, in order to avoid changes to the sample annotation set due to the modification of the label annotation data associated with the label template, this application can only change the label annotation data associated with the label template after receiving a request to change the label annotation data associated with the label template, while keeping the corresponding label annotation data in the sample annotation set unchanged.

[0162] The following explanation uses a flowchart as an example. Figure 5 This illustration shows another flowchart of a data processing method provided in an embodiment of this application. This embodiment may include:

[0163] S501, determine the first tag template selected by the user.

[0164] S502, obtain the user's input sample generation requirements.

[0165] The sample generation requirement describes the characteristics of the sample set that the user needs to generate. In this application, the sample generation requirement includes at least the target number of samples to be generated.

[0166] S503, based on the sample generation requirements, determine the target number of target label annotation data from the label annotation dataset associated with the first label template.

[0167] S504, construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as the sample annotation set associated with the first label template, so that the target files associated with the target label annotation data in the target sample annotation set in the data management system and the target sample annotation set form a sample set for model training, testing or verification.

[0168] The steps S501 to S504 above can be found in the relevant descriptions of the previous embodiments, and will not be repeated here.

[0169] S505, Received a request to change the annotation data.

[0170] The annotation data change request is used to request changes to the label annotation data within the label annotation dataset of the label template.

[0171] For example, a label data change request can carry the template identifier of the label template associated with the label label data for which the change is requested, as well as the data identifier of the label label data. For instance, a label data change request could request a change to the label label data associated with a first label template.

[0172] In one possible implementation, the data management system can receive query requests for tag templates. For example, it can detect when a user selects a tag template or enters a tag template to query on the tag system interface, and confirm that a query request for that tag template has been received. In response to the query request for the tag template, the system can display or output the tag labeling data associated with the tag template to a computer device outside the data management system. Based on this, the user can select the tag labeling data to be changed and trigger the generation of a tag data change request.

[0173] S506, Based on the annotation data change request, perform change processing on the label annotation data to be changed in the label annotation dataset associated with the label template.

[0174] The annotation data change request can request the modification or deletion of label annotation data associated with a specific label template, without any restrictions. Correspondingly, based on the actual operation requested in the annotation data change request, the label annotation data associated with the corresponding label template can be modified or deleted.

[0175] In one alternative approach, in order to ensure that relevant information before the change can still be retrieved after the label annotation data is changed, this application may also record the label annotation data to be changed in the historical data record before the change processing of the label annotation dataset associated with the label template.

[0176] S507, Received sample set update request.

[0177] Specifically, the sample set update request is used to request an update of the target label annotation data in the target sample annotation set associated with the first label template.

[0178] The purpose of the sample set update request is to ensure that the target label annotation data in the target sample annotation set is consistent with the corresponding label annotation data in the label annotation dataset associated with the first label template.

[0179] S508, for each target label annotation data in the target sample annotation set, if there is a change in the target label annotation data in the label annotation dataset associated with the first label template, the target label annotation data in the target sample annotation set is modified based on the change method of the target label annotation data in the label annotation dataset associated with the first label template.

[0180] Specifically, based on the change method of the target label annotation data in the label annotation dataset associated with the first label template, the target label annotation data in the target sample annotation set is modified to ensure that the target label annotation data in the target sample annotation set is consistent with the target label annotation data in the label annotation dataset associated with the first label template.

[0181] For example, suppose that label annotation data 1 in the label annotation dataset associated with the first label template has been modified, and that label annotation data 1 is the label annotation data in the target sample annotation set, then the label annotation data in the target label sample set needs to be updated to the label annotation data 1 stored in the label annotation dataset.

[0182] For example, if label annotation data 2 in the label annotation dataset associated with the first label template has been deleted, then label annotation data 2 in the target label annotation dataset also needs to be deleted.

[0183] Furthermore, to facilitate data management system administrators or other users in understanding the data status of the target sample annotation set at different times, this application can store a snapshot of the target sample annotation set in the historical data record after updating the target sample annotation set. Based on this, if the target sample annotation set is subsequently modified, the previous label annotation data included in the target sample annotation set can be retrieved from this historical data record.

[0184] As can be seen from the above, this application will not synchronously update the corresponding label annotation data in the sample annotation set associated with the label template after the label annotation data associated with the label template is changed, thereby maintaining the integrity and consistency of the sample annotation set that has been used to train the model (including model training, testing and validation), and reducing the impact on the training model.

[0185] Furthermore, to avoid manually adjusting the sample annotation set after changes to the label annotation data associated with the label template, this application can also update the sample annotation set based on the change method of the label annotation data associated with the label template after detecting a sample set update request for the sample annotation set, thereby improving the convenience of updating the label annotation data within the sample annotation set.

[0186] Furthermore, this application also provides a data management system architecture, such as... Figure 6 As shown, this illustration depicts a schematic diagram of the architecture of a data management system provided in an embodiment of this application. The method of this embodiment may include:

[0187] The tag management module 601 is used to obtain and store at least one tag module, wherein the tag template defines the composition structure of the tag;

[0188] The annotation data management module 602 is used to obtain and store at least one label annotation dataset, wherein the label annotation dataset is associated with a label template, and the label annotation dataset includes at least one piece of label annotation data constructed using the label template associated with the label annotation dataset;

[0189] File management module 603 is used to obtain and store a file set, wherein each file in the file set is associated with a tag annotation data from at least one tag annotation dataset.

[0190] In one possible implementation, the tag management module includes:

[0191] The template acquisition submodule is used to obtain the tag template created by the user, wherein the tag template defines the tag structure that makes up the tag;

[0192] The template storage submodule is used to store the obtained label templates into the template storage space set in the data management system.

[0193] In yet another possible implementation, the template obtains submodules, including:

[0194] The UI display submodule is used to detect tag addition requests and display the tag addition interface.

[0195] The template creation subunit is used to obtain at least one tag item configured by the user in the tag adding interface to form a tag and the data format corresponding to each tag item, so as to obtain the created tag template, wherein the tag template includes the at least one tag item and the data format corresponding to each tag item.

[0196] In yet another possible implementation, the labeled data management module includes:

[0197] The template determination submodule is used to determine the candidate tag templates selected by the user to add annotation data. The candidate tag templates belong to the tag templates stored in the data management system.

[0198] The data acquisition submodule is used to acquire at least one tag annotation data to be added to the candidate tag template;

[0199] The data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if the tag annotation data conforms to the composition structure of the tag defined by the candidate tag template.

[0200] In another possible implementation, the tag template maintained by the tag management module defines at least one tag item that makes up the tag and the data format corresponding to each tag item;

[0201] Specifically, the data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if all the tag items in the tag annotation data belong to the tag items defined in the candidate tag template, and the data format corresponding to the value of the tag item in the tag annotation data matches the data format of the corresponding tag item in the candidate tag template.

[0202] On the other hand, corresponding to the data processing method provided in the embodiments of this application, this application also provides a data processing apparatus, such as... Figure 7 The diagram illustrates a possible structural composition of a data processing apparatus provided in an embodiment of this application. The apparatus in this embodiment may include:

[0203] Template determination unit 701 is used to determine the target tag template selected by the user. The target tag template belongs to at least one tag template stored in the data management system. The tag template defines the composition structure of the tag.

[0204] The requirement acquisition unit 702 is used to acquire the sample generation requirement input by the user, wherein the sample generation requirement includes at least the target number of samples to be generated;

[0205] The annotation determination unit 703 is used to determine the target number of target label annotation data from the label annotation dataset associated with the target label template in the data management system based on the sample generation requirements. The label annotation dataset associated with the target label template includes at least one label annotation data constructed using the target label template.

[0206] The annotation set generation unit 704 is used to construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as a sample annotation set associated with the first label template, so that the target files associated with each target label annotation data in the target sample annotation set in the data management system and the target sample annotation set together form a sample set for model training, testing or verification.

[0207] In one possible implementation, the device further includes:

[0208] The first request obtaining unit is used to obtain a label data change request, wherein the label data change request is used to request to change the label label data in the label label dataset associated with the target label template;

[0209] The first modification unit is used to modify the label annotation data to be changed in the label annotation dataset associated with the target label template based on the label data modification request.

[0210] In yet another possible implementation, the device further includes:

[0211] The second request obtaining unit is used to obtain a sample set update request after the data change unit performs change processing on the label annotation data to be changed in the label annotation dataset associated with the target label template. The sample set update request is used to request the update of the target label annotation data in the target sample annotation dataset associated with the target label template.

[0212] The second modification unit is used to modify the target label annotation data in the target sample annotation set for each target label annotation data in the target sample annotation set if there is a change in the target label annotation data in the label annotation dataset associated with the target label template.

[0213] In yet another possible implementation, the device further includes:

[0214] The first recording unit is used to record, in the historical data record, the label annotation data to be changed in the label annotation dataset associated with the target label template before the change processing, while or after the first change unit performs change processing on the label annotation data to be changed in the label annotation dataset associated with the target label template.

[0215] The second recording unit is used to store a data snapshot of the target label set in the historical data record after the second modification unit has modified the target label data in the target sample label set.

[0216] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. Furthermore, the features described in the various embodiments of this specification can be substituted or combined with each other, enabling those skilled in the art to implement or use this application. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0217] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.

[0218] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data management system, characterized in that, include: The tag management module is used to obtain and store at least one tag template. The tag template defines the composition structure of the tag. The tag management module maintains the tag template, which defines at least one tag item that makes up the tag. The tag template is a tree structure. The tag template includes a unique main tag item and multi-level sub-tag items under the main tag item. The names of sub-tag items at the same level are not repeated. The annotation data management module is used to obtain and store at least one label annotation dataset, wherein the label annotation dataset is associated with a label template, and the label annotation dataset includes at least one piece of label annotation data constructed using the label template associated with the label annotation dataset; The file management module is used to obtain and store a file set, wherein each file in the file set is associated with a tag annotation data from at least one tag annotation dataset. The labeled data management module includes: The template determination submodule is used to determine the candidate tag templates selected by the user to add annotation data. The candidate tag templates belong to the tag templates stored in the data management system. The data acquisition submodule is used to acquire at least one tag annotation data to be added to the candidate tag template; The data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if the tag annotation data conforms to the composition structure of the tag defined by the candidate tag template.

2. The data management system according to claim 1, characterized in that, The tag management module includes: The template acquisition submodule is used to obtain the tag template created by the user, wherein the tag template defines the tag structure that makes up the tag; The template storage submodule is used to store the obtained label templates into the template storage space set in the data management system.

3. The data management system according to claim 2, characterized in that, The template acquisition submodule includes: The UI display submodule is used to detect tag addition requests and display the tag addition interface. The template creation subunit is used to obtain at least one tag item configured by the user in the tag adding interface to form a tag and the data format corresponding to each tag item, so as to obtain the created tag template. The tag template also includes the data format corresponding to each tag item.

4. The data management system according to claim 1, characterized in that, The tag template maintained by the tag management module also defines the data format corresponding to each tag item; Specifically, the data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if all the tag items in the tag annotation data belong to the tag items defined in the candidate tag template, and the data format corresponding to the value of the tag item in the tag annotation data matches the data format of the corresponding tag item in the candidate tag template.

5. A data processing method, characterized in that, include: The target tag template selected by the user is determined. The target tag template belongs to at least one tag template stored in the tag management module of the data management system. The tag template defines the composition structure of the tag. The tag management module maintains the tag template and defines at least one tag item that makes up the tag. The tag template has a tree structure. The tag template includes a unique main tag item and multi-level sub-tag items under the main tag item. The names of sub-tag items at the same level are not repeated. Obtain the sample generation requirements input by the user, wherein the sample generation requirements include at least the target number of samples to be generated; Based on the sample generation requirements, the target number of target label annotation data are determined from the label annotation dataset associated with the target label template in the label data management module of the data management system. The label annotation dataset associated with the target label template includes at least one label annotation data constructed using the target label template. Construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as a sample annotation set associated with the target label template, so that the target files associated with each target label annotation data in the target sample annotation set in the data management system and the target sample annotation set together form a sample set for model training, testing or verification. The labeled data management module includes: The template determination submodule is used to determine the candidate tag templates selected by the user to add annotation data. The candidate tag templates belong to the tag templates stored in the data management system. The data acquisition submodule is used to acquire at least one tag annotation data to be added to the candidate tag template; The data storage submodule is used to store each tag annotation data to be added to the candidate tag template in the tag annotation dataset associated with the candidate tag template if the tag annotation data conforms to the composition structure of the tag defined by the candidate tag template.

6. The method according to claim 5, characterized in that, After storing the target sample annotation set as the sample annotation set associated with the target label template, the method further includes: The system receives a request from the terminal to obtain the target sample annotation set, retrieves each target file associated with the target label annotation data in the target sample annotation set from the file set, and obtains a file sample set containing the target files. The file sample set and the target sample annotation set are sent as a sample set to the terminal.

7. The method according to claim 5, characterized in that, Also includes: Obtain a label data change request, wherein the label data change request is used to request changes to the label label data in the label label dataset associated with the target label template; Based on the annotation data change request, the label annotation data to be changed in the label annotation dataset associated with the target label template is modified.

8. The method according to claim 7, characterized in that, After performing the change processing on the label annotation data to be modified in the label annotation dataset associated with the target label template, the process further includes: Obtain a sample set update request, the sample set update request being used to request an update of the target label annotation data in the target sample annotation set associated with the target label template; For each target label annotation data in the target sample annotation set, if there is a change in the target label annotation data in the label annotation dataset associated with the target label template, the target label annotation data in the target sample annotation set is modified based on the change method of the target label annotation data in the label annotation dataset associated with the target label template.

9. The method according to claim 8, characterized in that, Simultaneously or after the modification processing of the label annotation data to be changed in the label annotation dataset associated with the target label template, the method further includes: Record the label annotation data to be changed in the historical data record before the change processing of the label annotation dataset associated with the target label template; After modifying the target label annotation data in the target sample annotation set, the method further includes: A snapshot of the target sample annotation set is stored in the historical data record.

10. A data processing apparatus, characterized in that, The device includes: The template determination unit is used to determine the target tag template selected by the user. The target tag template belongs to at least one tag template stored in the tag management module of the data management system. The tag template defines the composition structure of the tag. The tag management module maintains the tag template, which defines at least one tag item that makes up the tag. The tag template has a tree structure. The tag template includes a unique main tag item and multi-level sub-tag items under the main tag item. The names of the sub-tag items at the same level are not repeated. The requirement acquisition unit is used to acquire the sample generation requirement input by the user, wherein the sample generation requirement includes at least the target number of samples to be generated; The annotation determination unit is used to determine the target number of target label annotation data from the label annotation dataset associated with the target label template in the annotation data management module of the data management system based on the sample generation requirements. The label annotation dataset associated with the target label template includes at least one label annotation data constructed using the target label template. The annotation set generation unit is used to construct a target sample annotation set containing the target number of target label annotation data, and store the target sample annotation set as a sample annotation set associated with the target label template, so that the target files associated with each target label annotation data in the target sample annotation set in the data management system and the target sample annotation set form a sample set for model training, testing, or validation; wherein, the annotation data management module includes: a template determination submodule, used to determine the candidate label template selected by the user to add annotation data, the candidate label template belonging to the label template stored in the data management system; a data acquisition submodule, used to obtain at least one label annotation data to be added to the candidate label template; and a data storage submodule, used to, for each label annotation data to be added to the candidate label template, if the label annotation data conforms to the composition structure of the label defined by the candidate label template, store the label annotation data in the label annotation dataset associated with the candidate label template.