Dataset production method, apparatus, device, and storage medium
By using a data segmentation method based on subject identification, an N-fold dataset is generated, which solves the problems of low dataset production efficiency and cross-data, and improves the accuracy of model training and the rigor of the dataset.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING NATONG MEDICAL ROBOT TECH CO LTD
- Filing Date
- 2022-12-22
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies often result in inefficient dataset creation and the presence of overlapping data, which negatively impacts model training accuracy.
By acquiring sample information from multiple target samples, and using subject identification to perform N data partitions, an N-fold data set is generated, ensuring that target samples with the same subject identification are located in the same dataset.
It improves the efficiency and rigor of dataset creation, reduces manual labor, avoids cross-referenced data, and enhances the overall accuracy of model training.
Smart Images

Figure CN116010814B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a method, apparatus, device, and storage medium for creating datasets. Background Technology
[0002] In machine learning, model training often requires multiple datasets. For example, training, validation, and test sets are often needed for training an algorithm model. The training set is used to train the model by providing inputs and corresponding outputs, allowing the model to learn the relationship between inputs and outputs. The validation set is used to estimate the training level of the model, such as the classification accuracy of the classifier and the prediction error. The test set is used to test the performance of the trained model.
[0003] Currently, datasets are typically created as follows: users preprocess the collected, messy data to form multiple samples, and then divide these samples into different datasets, thus obtaining multiple datasets. However, this manual method of creating datasets is not only inefficient and wasteful of manpower, but it also easily leads to overlapping data in different datasets, resulting in inaccurate sample data and affecting the overall training accuracy of the model. Summary of the Invention
[0004] To solve the above-mentioned technical problems, or at least partially solve them, this disclosure provides a method, apparatus, device, and storage medium for creating datasets.
[0005] A first aspect of this disclosure provides a method for creating a dataset, the method comprising:
[0006] Obtain sample information from multiple target samples, including subject identification information;
[0007] Based on the subject's identification, multiple target samples are divided into N data sets, resulting in an N-fold data set, where N is a positive integer. Each fold data set includes multiple datasets, and target samples corresponding to the same subject identification are located in the same dataset.
[0008] A second aspect of this disclosure provides a dataset creation apparatus, the apparatus comprising:
[0009] The acquisition module is used to acquire sample information from multiple target samples, including subject identification information.
[0010] The data partitioning module is used to partition multiple target samples N times based on the subject's identity to obtain an N-fold data set, where N is a positive integer. Each fold data set includes multiple datasets, and target samples corresponding to the same subject's identity are located in the same dataset.
[0011] A third aspect of this disclosure provides an electronic device, the server comprising: a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the method of the first aspect described above.
[0012] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the method of the first aspect described above.
[0013] The technical solution provided in this disclosure has the following advantages compared with the prior art:
[0014] This embodiment of the disclosure can acquire sample information of multiple target samples, wherein the sample information includes subject identification; the multiple target samples are divided into N data partitions according to the subject identification to obtain an N-fold data set, where N is a positive integer, and each fold data set includes multiple datasets, with target samples corresponding to the same subject identification located in the same dataset. It is evident that by adopting the above technical solution, multiple target samples can be automatically divided into N data partitions to obtain an N-fold data set, making dataset creation fast and simple, saving manual labor, and avoiding the problem of multiple target samples of the same subject being divided into different datasets, thereby improving the problem of overlapping data in different datasets and enhancing the rigor of the data set. This is beneficial for improving the overall training accuracy when training the model subsequently. Attached Figure Description
[0015] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0016] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a flowchart of a dataset creation method provided in an embodiment of this disclosure;
[0018] Figure 2 This is a flowchart illustrating another method for creating a dataset provided in this embodiment of the disclosure;
[0019] Figure 3 This is a flowchart illustrating a dataset creation process provided in an embodiment of this disclosure;
[0020] Figure 4 This is a schematic diagram of the structure of a dataset creation apparatus provided in an embodiment of this disclosure;
[0021] Figure 5 This is a schematic diagram of the structure of an electronic device according to an embodiment of this disclosure. Detailed Implementation
[0022] To better understand the above-mentioned objectives, features, and advantages of this disclosure, the solutions disclosed herein will be further described below. It should be noted that, unless otherwise specified, the embodiments and features described herein can be combined with each other.
[0023] Numerous specific details are set forth in the following description in order to provide a full understanding of this disclosure, but this disclosure may also be implemented in other ways different from those described herein; obviously, the embodiments in the specification are only some, and not all, of the embodiments of this disclosure.
[0024] Figure 1 This is a flowchart illustrating a dataset creation method provided in an embodiment of this disclosure. This method can be executed by an electronic device. The electronic device can be exemplarily understood as a device with page display capabilities, such as a mobile phone, tablet computer, laptop computer, desktop computer, or smart TV. Figure 1 As shown, the method provided in this embodiment includes the following steps:
[0025] S110. Obtain sample information for multiple target samples, including subject identification information.
[0026] Specifically, the subject identification identifier (i.e., subject ID) corresponds one-to-one with the subject and is used to represent the subject's identity. Its specific form can be a number, a string, etc., but is not limited to this.
[0027] Of course, sample information may also include sample identifier (i.e., sample ID), sample creation time, etc., but is not limited to these.
[0028] Specifically, the file format of the sample information may include comma-separated values (CSV) files, txt files, excel files, etc., but is not limited to these.
[0029] In some embodiments, S110 may include: receiving sample information of multiple target samples sent by other electronic devices, or reading sample information of multiple target samples from a storage device, etc.
[0030] In other embodiments, the sample information includes a label, wherein S110 may include: S111, obtaining sample information of multiple original samples; S112, using the original sample labeled with a first preset label as the target sample to obtain sample information of multiple target samples.
[0031] Optionally, S111 may include: traversing the folder containing the original samples to obtain sample information for multiple original samples.
[0032] Specifically, each original sample can be stored in a file. The specific file format of the original sample can include txt files, excel files, etc., but is not limited to these.
[0033] Specifically, when the filename of the original sample includes sample information, it is possible to traverse all folders where the original sample is located to obtain the filename information of multiple original samples and extract the sample information from the filename information, but it is not limited to this.
[0034] It is understood that by setting each original sample to be stored in a file, each original sample can contain multiple feature data. Accordingly, the generated dataset is a multivariate time series dataset. Thus, the embodiments of this disclosure can generate not only unit datasets but also multivariate time series datasets.
[0035] Specifically, the specific type of the label can be a category, a number, etc., but it is not limited to these.
[0036] Specifically, the first preset label is the label that the original sample might have under normal circumstances. For example, for a sample obtained for testing whether there is lung cancer in the lungs, the labels that the sample might have under normal circumstances are "lung cancer" and "healthy". Therefore, the first preset label is "lung cancer" and "healthy".
[0037] Since there may be samples with abnormal labels in the original samples, for example, samples obtained for detecting whether there is lung cancer in the lungs, if the sample label is empty or other content different from "lung cancer" and "healthy", it indicates that the sample has an abnormal label. Such samples with abnormal labels should be discarded to avoid affecting the training accuracy of the model in subsequent training.
[0038] Understandably, by using the original samples labeled with the first preset label as the target samples, the original samples with abnormal labels can be filtered out. In this way, the target samples used to create the dataset have normal labels, which helps to improve the rigor of the dataset.
[0039] S120. Divide multiple target samples into N data segments based on the subject's identity identifier to obtain an N-fold data set, where N is a positive integer. Each fold data set includes multiple datasets, and target samples corresponding to the same subject's identity identifier are located in the same dataset.
[0040] In this embodiment of the disclosure, a data set can be obtained by performing a data partitioning on each pair of multiple target samples.
[0041] Specifically, the specific value of N can be set by those skilled in the art according to the actual situation, and is not limited here. For example, N = 10, but it is not limited to this.
[0042] Specifically, multiple datasets may include training sets, validation sets, test sets, etc., but are not limited to these.
[0043] In some embodiments, S120 may include: S1211, determining multiple subject identifiers corresponding to multiple target samples; S1212, performing the following steps N times: randomly selecting multiple subject identifiers according to a preset ratio and assigning them to multiple datasets; S1213, for each dataset in each fold of the dataset, storing the target sample corresponding to the subject identifier assigned to the dataset in the folder corresponding to the dataset.
[0044] In other embodiments, the sample information includes labels, and S120 may include: S122, dividing multiple target samples into N data segments based on the subject's identity and labels to obtain an N-fold data set.
[0045] Optionally, S122 may include: S1221, determining the target label corresponding to the subject's identity based on the label of the target sample corresponding to the subject's identity; S1222, dividing multiple target samples into N data segments based on the subject's identity and its corresponding target label to obtain an N-fold data set.
[0046] Specifically, since the same subject may have multiple samples, that is, the same subject identification may correspond to the labels of multiple target samples, and the labels of these multiple target samples may be the same or different, a unique target label can be determined for the subject identification.
[0047] Specifically, the target samples corresponding to the subject identification mentioned here refer to all target samples corresponding to the subject identification.
[0048] It is understandable that taking both subject identification and labels into account when dividing multiple target samples into data segments helps to include target samples with multiple labels in the dataset, thus enriching the sample content of the dataset.
[0049] This embodiment of the disclosure can acquire sample information of multiple target samples, wherein the sample information includes subject identification; the multiple target samples are divided into N data partitions according to the subject identification to obtain an N-fold data set, where N is a positive integer, and each fold data set includes multiple datasets, with target samples corresponding to the same subject identification located in the same dataset. It is evident that by adopting the above technical solution, multiple target samples can be automatically divided into N data partitions to obtain an N-fold data set, making dataset creation fast and simple, saving manual labor, and avoiding the problem of multiple target samples of the same subject being divided into different datasets, thereby improving the problem of overlapping data in different datasets and enhancing the rigor of the data set. This is beneficial for improving the overall training accuracy when training the model subsequently.
[0050] Figure 2 This is a flowchart illustrating another method for creating a dataset provided in this disclosure. This disclosure optimizes the above embodiments and can be combined with various optional solutions from one or more of the above embodiments.
[0051] like Figure 2 As shown, the method for creating this dataset may include the following steps.
[0052] S210. Obtain sample information for multiple target samples, including subject identification and tags.
[0053] Specifically, S210 and S110 are similar, and will not be described in detail here.
[0054] S220. Determine the target label corresponding to the subject's identity based on the label of the target sample corresponding to the subject's identity.
[0055] In some embodiments, S220 may include: S221, counting multiple subject identifiers corresponding to multiple target samples; S222, for each subject identifier, determining the target label corresponding to the subject identifier based on the label of the target sample corresponding to the subject identifier.
[0056] In one example, S222 may include: when the labels of the target samples corresponding to the subject identification are the same, the label of the target sample corresponding to the subject identification is used as the target label corresponding to the subject identification.
[0057] Specifically, when the number of target samples corresponding to a subject's identity is 1, or when the number of all target samples corresponding to a subject's identity is greater than 1 and the labels of the target samples corresponding to the subject's identity are the same, the label of the target sample corresponding to the subject's identity can be determined as the corresponding target label.
[0058] In another example, S222 may include: when the labels of the target samples corresponding to the subject identification are different, if the labels of the target samples corresponding to the subject identification include a second preset label, the second preset label shall be used as the target label corresponding to the subject identification. Optionally, if the labels of the target samples corresponding to the subject identification do not include the second preset label, a label may be randomly selected as the target label, or the label with the most occurrences may be selected as the target label, but this is not limited to these options.
[0059] Specifically, the different labels of the target samples corresponding to the subject identification means that the number of all target samples corresponding to the subject identification is greater than 1, and there are at least two different labels among the labels of the target samples corresponding to the subject identification.
[0060] Specifically, the second preset label is one of the labels that may appear in the target sample. Those skilled in the art can select the second preset label according to the actual situation, and there is no limitation here.
[0061] For example, in the detection of lung cancer, the target sample may be labeled as "lung cancer" or "healthy". When the second preset label is "lung cancer", if "lung cancer" appears in the labels of multiple target samples corresponding to the same subject identity, then the target label of that subject identity is determined to be "lung cancer".
[0062] In another example, S222 may include: when the labels of the target samples corresponding to the subject identity are different, selecting the label with the most occurrences from the labels of the target samples corresponding to the subject identity and using it as the target label corresponding to the subject identity.
[0063] For example, in the detection of lung cancer, if the labels of three target samples corresponding to the same subject identity are "lung cancer", "healthy" and "healthy", then the target label of the subject identity is determined to be "healthy".
[0064] It is understandable that by using the second preset label as the target label corresponding to the subject's identity or using the label with the most numbers as the target label corresponding to the subject's identity, a unique and more realistic target label can be determined for the subject's identity, so as to obtain more accurate grouping results when grouping the subject's identity according to the target label in the future.
[0065] Optionally, the subject's identity and corresponding target label can be represented using a dictionary. Specifically, the dictionary key can be the subject's identity, and the value can be the target label.
[0066] S230. Group the subject identity identifiers with the same target label into the same list to obtain multiple subject identity identifier lists.
[0067] Specifically, the list of subject identifiers can store either the subject identifier itself or the index value corresponding to the subject identifier, without limitation.
[0068] In some embodiments, when the type of the label is a category, the subject identifications corresponding to the target labels of the same category can be grouped into the same list to obtain multiple subject identification lists corresponding to multiple categories.
[0069] For example, for target samples obtained from the detection of whether lung cancer exists, the target labels are "lung cancer" and "healthy". The subject identity information corresponding to "lung cancer" can be divided into the same list, and the subject identity information corresponding to "healthy" can be divided into the same list, resulting in two subject identity information lists.
[0070] In other embodiments, when the type of the label is numerical, the subject identifications corresponding to the same numerical target label (or the target label within the same numerical range) can be grouped into the same list to obtain multiple subject identification lists corresponding to multiple numerical values (or numerical ranges).
[0071] S240. Perform the following steps N times: For the subject identification list, randomly select subject identifications from the subject identification list according to a preset ratio and distribute them to multiple datasets.
[0072] Specifically, the preset ratio mentioned here refers to the ratio corresponding to each of the multiple datasets in the same data set.
[0073] Specifically, for each subject identification list, the subject identifications in the list are divided into multiple parts according to the proportions corresponding to the multiple datasets in the same dataset, with each part of the subject identifications corresponding to one dataset. Thus, for each dataset, the sum of the multiple parts of subject identifications allocated to the dataset from the multiple subject identification lists is the subject identification assigned to the dataset.
[0074] For example, if the same dataset includes a training set, a validation set, and a test set, with the training set comprising 50%, the validation set 30%, and the test set 20%, then for the list of subject identifiers corresponding to "lung cancer," 50% of the subject identifiers are randomly selected and assigned to the training set. From the remaining subject identifiers in this list, 30% (here, 30% refers to the percentage relative to the initial list of subject identifiers for "lung cancer") are randomly selected and assigned to the validation set. The remaining subject identifiers in this list are then assigned to the test set. Similarly, for the list of subject identifiers corresponding to "healthy," 50% of the subject identifiers are randomly selected and assigned to the training set. From the remaining subject identifiers in this list, 30% (here, 30% refers to the percentage relative to the initial list of subject identifiers for "healthy") are randomly selected and assigned to the validation set. The remaining subject identifiers in this list are then assigned to the test set. Of course, the order in which subject identifiers are assigned to the training set, validation set, and test set is not limited. Thus, the subject identifiers assigned to the training set include 50% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "lung cancer" and 50% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "healthy". The subject identifiers assigned to the validation set include 30% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "lung cancer" and 30% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "healthy". The subject identifiers assigned to the test set include 20% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "lung cancer" and 20% of the subject identifiers randomly selected from the list of subject identifiers corresponding to "healthy".
[0075] S250. For each dataset in each fold dataset, store the target sample corresponding to the subject identity identifier assigned to the dataset in the folder corresponding to the dataset.
[0076] In some embodiments, S250 may include: for each dataset in each fold dataset, indexing the sample information of the target sample corresponding to the subject identity of the dataset based on the sample information of multiple target samples, and storing the target sample of the dataset to the folder corresponding to the dataset based on the indexed sample information.
[0077] For example, the sample information of multiple target samples is in the format of a CSV file, which includes subject identification, sample identification, and tags. Based on the sample information, the sample information of the target sample corresponding to the subject identification can be indexed, that is, the sample information of the target sample assigned to the dataset can be determined. In this way, the target sample assigned to the dataset can be stored in the folder corresponding to the dataset based on the indexed sample information.
[0078] In other embodiments, S250 may include: for the labels of multiple target samples, grouping the same labels into the same list to obtain multiple label lists; for each dataset in each fold of the dataset, indexing the sample identifier of the target sample corresponding to the subject identity identifier assigned to the dataset from the multiple label lists, and storing the target sample assigned to the dataset into the folder corresponding to the dataset according to the indexed sample identifier.
[0079] Specifically, the label list can store labels and their corresponding sample identifiers and subject identity identifiers, or it can store the index values (i.e., indexes) corresponding to the labels and their corresponding sample identifiers and subject identity identifiers; there is no limitation on this.
[0080] In some embodiments, when the type of a tag is a category, tags of the same category can be grouped into the same list to obtain multiple tag lists corresponding to multiple categories.
[0081] In other embodiments, when the type of the label is numerical, the labels corresponding to the same numerical value (or the labels within the same numerical range) can be grouped into the same list to obtain multiple label lists corresponding to multiple numerical values (or numerical ranges).
[0082] This embodiment of the disclosure establishes a list of subject identifiers for each subject, randomly selects subject identifiers from the list according to a preset ratio, and assigns them to multiple datasets. For each dataset, the target samples corresponding to the assigned subject identifiers are stored in the corresponding folder. This allows the datasets to include target samples with multiple labels, and ensures that target samples corresponding to the same subject identifier are located in the same dataset. This improves the diversity of sample types in the dataset and enhances the rigor of the dataset. Furthermore, related technologies that directly generate datasets using Python packages such as skleran store the entire sample data in a cache and operate on the samples directly, creating multiple datasets at once. This places high demands on hardware memory and I / O, making it unsuitable for large-scale data. In contrast, this embodiment first determines the target samples assigned to each dataset based on the sample information, and then actually operates on the target samples and stores them in the folder corresponding to their respective datasets. This reduces the demands on hardware memory and I / O, making it still applicable to large-scale data.
[0083] In another implementation, optionally, after dividing multiple target samples into N data segments based on the subject's identity to obtain an N-fold data set, the method further includes: performing pairwise cross-validation on multiple datasets in each fold data set; and issuing an error alarm when a target sample corresponding to the same subject's identity exists in two datasets.
[0084] Specifically, cross-validation includes verifying whether multiple target samples corresponding to the same subject identity are located in two separate datasets. It can also verify whether the same target sample is present in both datasets simultaneously, and issue an error alert if the same target sample exists in both datasets.
[0085] Understandably, by performing pairwise cross-validation on multiple datasets within the same dataset, error alerts can be issued promptly when overlapping data appears between datasets, allowing for manual verification and correction. This helps to further improve the problem of overlapping data between datasets and enhances the rigor of the datasets.
[0086] The dataset creation method provided in this disclosure will now be described in detail based on a specific example.
[0087] Figure 3 This is a flowchart illustrating a dataset creation process provided in an embodiment of this disclosure.
[0088] like Figure 3 As shown, the process of creating this dataset is as follows:
[0089] First, data (i.e., raw sample) statistics are performed, traversing various data folders to generate CSV statistical documents. Specifically, sample information for multiple raw samples is obtained from various data folders, and a CSV statistical document is generated based on this information, including Case_ID (i.e., subject identification), Sample_ID (i.e., sample identifier), and Label (i.e., tag). Next, category selection and subject count are performed based on the CSV statistical document. Specifically, all sample information corresponding to the labels that need to be filtered out is deleted, and the data is rearranged to obtain statistical information such as sample identifiers, index values, and labels, identifying all subject identifications. Then, label checking is performed. Specifically, all samples for each subject are found based on their subject identification, and the unique label corresponding to each subject identification (i.e., target label) is selected based on the labels of these samples. There are two selection criteria. One is to use the first preset label as the unique label when it exists in the sample's labels (e.g., if the label corresponds to a 2-category classification, 0 represents negative and 1 represents positive; for multiple samples from the same subject, if a positive label appears, it means the subject is a positive patient, and positive is used as the unique label; otherwise, negative is used as the unique label). The other is to use the label that appears most frequently in the sample as the unique label corresponding to the subject's identity (labels are selected by voting, and the label category with the most occurrences is used as the unique label, because there may be labeling errors in multiple samplings of the same subject). Finally, the selected labels are matched one-to-one with the subject's identity and re-labeled, represented by a dictionary, which represents the subject's identity and the corresponding label. Then, subject and label statistics, as well as sample and label statistics, are performed. Specifically, subject-level label statistics are performed, with the index of each subject's identity indices compiled into a list, and sample-level label statistics are performed, with the index of each sample indices compiled into a list. Then, construct the training set, test set, and validation set indexes for the subjects according to their categories. Specifically, create an N-fold (e.g., 10) cross-validation set (i.e., dataset): (1) Count the number of subjects in each category at the subject level and rearrange the subject indexes. (2) For each cross-validation set, randomly select a preset proportion of subject identifiers from the index list of each subject category (i.e., the list of index statistics of subject identifiers for each category) and assign them to the training set, test set, and validation set, and integrate the subject identifiers of each category to each dataset. (3) Select the sample identifiers corresponding to the subject identifiers through the subject identifiers of each dataset to ensure that all samples under the same subject identifier are selected into one dataset. (4) Define a description file to describe the number of samples corresponding to each category in the training set, test set, and validation set, as well as the total number of samples in the dataset.(5) For each dataset in the N-fold cross-validation set, save the sample identifier, subject identity identifier, and label of the samples assigned to the dataset to obtain an index document. Then, allocate N-fold datasets to local data. Specifically, during the dataset creation process, read the index document of the N-fold cross-validation set generated above. In the subsequent allocation and creation of the N-fold dataset, index the local data according to the Case_ID, Sample_ID, and Label in the index document, and allocate each type of dataset in each fold dataset to the corresponding category folder to complete the creation of the N-fold cross-validation dataset. Finally, verify whether there is any intersection between the samples of each dataset through pairwise validation.
[0090] This embodiment of the disclosure can automatically allocate multivariate time series samples according to a preset ratio to obtain a dataset, making dataset creation quick and simple, and saving manual labor. Furthermore, when multiple samples exist for a single subject, target labels can be determined to ensure that there are no overlapping samples from the training set, validation set, and test set. This improves the situation where multiple samples from the same subject are assigned to different datasets, leading to overlapping data, thereby making the sample data for machine learning more accurate and ensuring the precision of machine learning.
[0091] Figure 4 This is a schematic diagram of a dataset creation apparatus provided in an embodiment of this disclosure. This dataset creation apparatus can be understood as the aforementioned electronic device or a functional module within the aforementioned electronic device. Figure 4 As shown, the dataset creation apparatus 400 includes:
[0092] The acquisition module 410 is used to acquire sample information of multiple target samples, wherein the sample information includes subject identification;
[0093] The data partitioning module 420 is used to partition multiple target samples N times according to the subject's identity to obtain an N-fold data set, where N is a positive integer, and each fold data set includes multiple datasets, with target samples corresponding to the same subject's identity located in the same dataset.
[0094] In another embodiment of this disclosure, the sample information further includes a tag, wherein the acquisition module 410 may include:
[0095] The acquisition submodule is used to acquire sample information from multiple raw samples;
[0096] The first determining submodule is used to take the original sample labeled with the first preset label as the target sample and obtain sample information of multiple target samples.
[0097] In another embodiment of this disclosure, the sample information further includes labels, wherein the data segmentation module 420 may include:
[0098] The second determination submodule is used to determine the target label corresponding to the subject's identity based on the label of the target sample corresponding to the subject's identity.
[0099] The data partitioning submodule is used to partition multiple target samples N times based on the subject's identity identifier and its corresponding target label to obtain an N-fold data set.
[0100] In another embodiment of this disclosure, the second determining submodule may include:
[0101] The first determining unit is used to determine the target label corresponding to the subject's identity if the label of the target sample corresponding to the subject's identity includes a second preset label when the labels of the target samples corresponding to the subject's identity are different.
[0102] In another embodiment of this disclosure, the second determining submodule may include:
[0103] The second determining unit is used to select the label with the largest number from the labels of the target samples corresponding to the subject identity when the labels of the target samples corresponding to the subject identity are different, and use it as the target label corresponding to the subject identity.
[0104] In another embodiment of this disclosure, the data partitioning submodule may include: a first data partitioning submodule, used to partition subject identity identifiers with the same corresponding target label into the same list to obtain multiple subject identity identifier lists;
[0105] The second data partitioning submodule is used to perform the following steps N times: for the subject identification list, select the subject identifications in the subject identification list according to a preset ratio and distribute them to multiple datasets;
[0106] The third data partitioning submodule is used to store the target samples corresponding to the subject identification identifiers assigned to each dataset in each fold of the data set into the folder corresponding to the dataset.
[0107] In another embodiment of this disclosure, the apparatus may further include:
[0108] The cross-validation module is used to perform pairwise cross-validation on multiple datasets in each fold dataset after dividing multiple target samples into N data segments based on the subject's identity.
[0109] The alarm module is used to issue an error alarm when the same subject identity corresponding to the target sample exists in two datasets.
[0110] The apparatus provided in this embodiment can execute the methods of any of the above embodiments, and its execution method and beneficial effects are similar, so they will not be described again here.
[0111] This disclosure also provides an electronic device, which includes: a memory storing a computer program; and a processor for executing the computer program, wherein when the computer program is executed by the processor, it can implement the methods of any of the above embodiments.
[0112] Example, Figure 5 This is a schematic diagram of the structure of an electronic device according to an embodiment of this disclosure. See below for details. Figure 5 The diagram illustrates a structural schematic suitable for implementing the electronic device 500 in the embodiments of this disclosure. The electronic device 500 in the embodiments of this disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 5 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0113] like Figure 5 As shown, the electronic device 500 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 into a random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the electronic device 500. The processing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.
[0114] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 508 including, for example, magnetic tapes, hard disks, etc.; and communication devices 509. Communication device 509 allows electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 5An electronic device 500 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0115] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 509, or installed from a storage device 508, or installed from a ROM 502. When the computer program is executed by the processing device 501, it performs the functions defined in the methods of embodiments of this disclosure.
[0116] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0117] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0118] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0119] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to: acquire sample information of multiple target samples, wherein the sample information includes subject identification; divide the multiple target samples into N data segments according to the subject identification to obtain an N-fold data set, wherein N is a positive integer, each fold data set includes multiple datasets, and target samples corresponding to the same subject identification are located in the same dataset.
[0120] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0121] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0122] The units described in the embodiments of this disclosure can be implemented in software or hardware. The names of the units are not, in some cases, intended to limit the specific unit.
[0123] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0124] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0125] This disclosure also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it can implement the methods of any of the above embodiments. The execution method and beneficial effects are similar, and will not be described again here.
[0126] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0127] The above description is merely a specific embodiment of this disclosure, enabling those skilled in the art to understand or implement it. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this disclosure. Therefore, this disclosure is not to be limited to the embodiments described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for creating a dataset, characterized in that, include: Acquire sample information from multiple target samples, wherein the sample information includes subject identification; The multiple target samples are divided into N data segments based on the subject identification to obtain an N-fold data set, where N is a positive integer. Each fold data set includes multiple datasets, and target samples corresponding to the same subject identification are located in the same dataset. The sample information also includes labels, wherein the step of dividing the multiple target samples into N data segments based on the subject's identity identifier to obtain an N-fold data set includes: The target label corresponding to the subject's identity is determined based on the label of the target sample corresponding to the subject's identity. Subjects with the same target label are grouped into the same list, resulting in multiple lists of subject identities. Perform the following steps N times: For the subject identification list, randomly select subject identifications from the subject identification list according to a preset ratio and assign them to the multiple datasets; For each dataset in each fold dataset, the target sample corresponding to the subject identification assigned to the dataset is stored in the folder corresponding to the dataset.
2. The method according to claim 1, characterized in that, in, The acquisition of sample information for multiple target samples includes: Obtain sample information from multiple original samples; The original sample labeled with the first preset label is used as the target sample to obtain the sample information of the multiple target samples.
3. The method according to claim 1, characterized in that, The step of determining the target label corresponding to the subject identity based on the label of the target sample corresponding to the subject identity includes: When the labels of the target samples corresponding to the subject identification are different, if the labels of the target samples corresponding to the subject identification include a second preset label, the second preset label shall be used as the target label corresponding to the subject identification.
4. The method according to claim 1, characterized in that, The step of determining the target label corresponding to the subject identity based on the label of the target sample corresponding to the subject identity includes: When the labels of the target samples corresponding to the subject identification are different, the label with the most occurrences is selected from the labels of the target samples corresponding to the subject identification and used as the target label corresponding to the subject identification.
5. The method according to claim 1, characterized in that, After dividing the multiple target samples into N data segments based on the subject's identity identifier to obtain an N-fold data set, the method further includes: For each fold of the dataset, the multiple datasets are cross-validated pairwise. An error alert is issued when the same subject identification corresponds to a target sample in two datasets.
6. A dataset creation apparatus, characterized in that, include: The acquisition module is used to acquire sample information of multiple target samples, wherein the sample information includes subject identification; The data partitioning module is used to partition the multiple target samples N times according to the subject identity identifier to obtain an N-fold data set, where N is a positive integer, each fold data set includes multiple datasets, and target samples corresponding to the same subject identity identifier are located in the same dataset; The sample information also includes labels, wherein the step of dividing the multiple target samples into N data segments based on the subject's identity identifier to obtain an N-fold data set includes: The target label corresponding to the subject's identity is determined based on the label of the target sample corresponding to the subject's identity. Subjects with the same target label are grouped into the same list, resulting in multiple lists of subject identities. Perform the following steps N times: For the subject identification list, randomly select subject identifications from the subject identification list according to a preset ratio and assign them to the multiple datasets; For each dataset in each fold dataset, the target sample corresponding to the subject identification assigned to the dataset is stored in the folder corresponding to the dataset.
7. An electronic device, characterized in that, include: A processor and a memory, wherein the memory stores a computer program that, when executed by the processor, performs the method of any one of claims 1-5.
8. A computer-readable storage medium, characterized in that, The storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-5.