Data processing method, apparatus, computing device, and computer-readable storage medium

By constructing an irrelevant dataset and iteratively updating the sample weight distribution, the problem of low efficiency in dataset bias detection in machine learning is solved, achieving efficient and accurate dataset bias elimination and improved model robustness.

CN115471714BActive Publication Date: 2026-06-12HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD
Filing Date
2021-05-25
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively detect and eliminate dataset bias in machine learning, leading to performance degradation or ethical issues in practical applications. Furthermore, existing detection methods are inefficient and inaccurate.

Method used

By constructing an unrelated dataset, dividing it into training and test sets, and training and evaluating the classification model based on the sample weight distribution, iteratively updating the sample weight distribution until a threshold is met, and constructing an unbiased dataset to train a robust model.

🎯Benefits of technology

It achieves efficient and automated dataset bias detection and elimination, improves the robustness and accuracy of the model, and reduces human resource consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115471714B_ABST
    Figure CN115471714B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide a data processing method, apparatus, computing device and computer readable storage medium. In the method, an irrelevant dataset with labels is constructed based on a dataset to be processed; the irrelevant dataset is divided into a first dataset with a first sample weight distribution and a second dataset with a second sample weight distribution, the first and second sample weight distributions being determined based on sample weights of data items in the dataset to be processed; a classification model is trained based on the first dataset and the first sample weight distribution; and the classification model is evaluated based on the second dataset and the second sample weight distribution to obtain an evaluation result indicating bias significance of the dataset to be processed with the sample weight distribution. In this way, embodiments of the present disclosure can make more accurate judgments on bias significance of a dataset.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence, and more specifically, to a data processing method, apparatus, computing device, and computer-readable storage medium. Background Technology

[0002] Dataset bias is a widespread problem in machine learning, especially deep learning, with a significant negative impact that is difficult to detect and easily overlooked. Particularly in scenarios where model safety is critical, training on biased datasets can lead to serious incidents in real-world applications.

[0003] Currently, bias in datasets is checked by guessing or based on experience, but this approach requires a lot of human resources, is inefficient and inaccurate, and cannot meet practical needs. Summary of the Invention

[0004] Example embodiments of this disclosure provide a data processing method that includes a scheme for evaluating dataset bias, enabling more accurate checks on dataset bias.

[0005] Firstly, a data processing method is provided. This method includes: constructing an irrelevant dataset based on a dataset to be processed, the irrelevant dataset including labeled irrelevant data items, the labels of which are determined based on the labels of the data items to be processed in the dataset to be processed; dividing the irrelevant dataset into a first dataset and a second dataset, the first dataset having a first sample weight distribution and the second dataset having a second sample weight distribution, the first and second sample weight distributions being determined based on the sample weights of the data items to be processed in the dataset to be processed; training a classification model based on the first dataset and the first sample weight distribution; and evaluating the classification model based on the second dataset and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of bias in the dataset to be processed with the sample weight distribution.

[0006] Thus, through the embodiments of this disclosure, the significance of bias in a dataset can be assessed more accurately. This assessment method facilitates user adjustments and other processing of the dataset.

[0007] In some embodiments of the first aspect, the method further includes: if the evaluation result is greater than a preset threshold, updating the sample weight distribution of the dataset to be processed; and repeating the training and evaluation based on the updated sample weight distribution until the evaluation result is not greater than the preset threshold.

[0008] Thus, embodiments of this disclosure can update the sample weight distribution of the dataset to be processed based on a trained classification model, thereby obtaining a recommended sample weight distribution. This process does not require user intervention, is highly efficient, and has a high degree of automation.

[0009] In some embodiments of the first aspect, updating the sample weight distribution includes: updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

[0010] In some embodiments of the first aspect, the method further includes: using the sample weight distribution when the evaluation result is not greater than a preset threshold as the recommended sample weight distribution.

[0011] Thus, embodiments of this disclosure can update the sample weight distribution based on iteratively training a classification model, and can observe the changes in dataset bias as the sample weight distribution is updated. This allows for iterative detection of the dataset to be processed, resulting in an effective and highly accurate recommended sample weight distribution.

[0012] In some embodiments of the first aspect, the method further includes: adding or deleting data from the dataset to be processed based on the weight distribution of the recommended samples to construct an unbiased dataset.

[0013] Thus, in the embodiments of this disclosure, the dataset to be processed can be added or deleted based on the weight distribution of recommended samples, thereby constructing an unbiased dataset. Furthermore, this unbiased dataset can be used to train more robust and unbiased models for specific tasks, thereby meeting practical needs.

[0014] In some embodiments of the first aspect, updating the sample weight distribution includes at least one of the following: updating the sample weight distribution using a predetermined rule, updating the sample weight distribution in a random manner, obtaining user modifications to the sample weight distribution to update the sample weight distribution, or optimizing the sample weight distribution using a genetic algorithm to update the sample weight distribution.

[0015] In some embodiments of the first aspect, constructing an irrelevant dataset based on a dataset to be processed includes: removing a portion of a target data item to be processed from the dataset to be processed that is associated with the label of the target data item to be processed, so as to obtain a remainder in the target data item to be processed; and using the remainder to construct an irrelevant data item in the irrelevant dataset, wherein the label of the irrelevant data item corresponds to the label of the target data item to be processed.

[0016] In some embodiments of the first aspect, the dataset to be processed is an image dataset, and the construction of an irrelevant dataset based on the dataset to be processed includes: performing image segmentation on a target data item to be processed in the dataset to be processed to obtain a background image corresponding to the target data item to be processed; and using the background image to construct an irrelevant data item in the irrelevant dataset.

[0017] Thus, in the embodiments of this disclosure, the background image is used as a representative of bias, thereby enabling bias checking of the dataset.

[0018] In some embodiments of the first aspect, the data item to be processed in the dataset to be processed is a video sequence, and the construction of an irrelevant dataset based on the dataset to be processed includes: determining a binary image of the video sequence based on gradient information between a frame of the video sequence and the previous frame of the frame of the video sequence; generating a background image of the video sequence based on the binary image; and using the background image of the video sequence to construct an irrelevant data item in the irrelevant dataset.

[0019] Thus, considering the similarity between the frames in the video sequence and the fact that the background remains basically unchanged in the video sequence, the background image corresponding to the video sequence can be obtained.

[0020] In some embodiments of the first aspect, the method further includes: obtaining a class activation map (CAM) by inputting target-independent data items into a trained classification model; obtaining a superposition result by superimposing the CAM with the target-independent data items; and displaying the superposition result.

[0021] Therefore, the embodiments of this disclosure provide a scheme for quantitatively assessing dataset bias, thereby clearly characterizing the salience of dataset bias and visually presenting the specific locations where bias occurs. This allows users to understand the dataset bias situation more intuitively and comprehensively. This scheme requires minimal user intervention, can be automated, and improves processing efficiency while ensuring the accuracy of quantitative bias assessment.

[0022] Secondly, a data processing apparatus is provided. The apparatus includes: a construction unit configured to construct an irrelevant dataset based on a dataset to be processed, the irrelevant dataset including labeled irrelevant data items, the labels of which are determined based on the labels of data items to be processed in the dataset to be processed; a partitioning unit configured to partition the irrelevant dataset into a first dataset and a second dataset, the first dataset having a first sample weight distribution and the second dataset having a second sample weight distribution, the first and second sample weight distributions being determined based on the sample weights of data items to be processed in the dataset to be processed; a training unit configured to train a classification model based on the first dataset and the first sample weight distribution; and an evaluation unit configured to evaluate the classification model based on the second dataset and the second sample weight distribution to obtain an evaluation result indicating the significance of bias in the dataset to be processed having the sample weight distribution.

[0023] In some embodiments of the second aspect, an update unit is also included, configured to update the sample weight distribution of the dataset to be processed if the evaluation result is greater than a preset threshold.

[0024] In some embodiments of the second aspect, the updating unit is configured to update a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

[0025] In some embodiments of the second aspect, the update unit is configured to use the sample weight distribution when the evaluation result is not greater than a preset threshold as the recommended sample weight distribution.

[0026] In some embodiments of the second aspect, an adjustment unit is also included, configured to add or delete data in the dataset to be processed based on the recommended sample weight distribution to construct an unbiased dataset.

[0027] In some embodiments of the second aspect, the updating unit is configured to update the sample weight distribution by at least one of the following: updating the sample weight distribution using a predetermined rule, updating the sample weight distribution in a random manner, obtaining user modifications to the sample weight distribution to update the sample weight distribution, or optimizing the sample weight distribution using a genetic algorithm to update the sample weight distribution.

[0028] In some embodiments of the second aspect, the construction unit is configured to: remove the portion associated with the label of the target data item from the target data item in the dataset to be processed, so as to obtain the remaining portion in the target data item; and use the remaining portion to construct an irrelevant data item in the irrelevant dataset, wherein the label of the irrelevant data item corresponds to the label of the target data item.

[0029] In some embodiments of the second aspect, the dataset to be processed is an image dataset, and the building unit is configured to: perform image segmentation on a target data item to be processed in the dataset to obtain a background image corresponding to the target data item; and use the background image to construct an irrelevant data item in the irrelevant dataset.

[0030] In some embodiments of the second aspect, the data item to be processed in the dataset to be processed is a video sequence, and the construction unit is configured to: determine a binary image of the video sequence based on gradient information between a frame of the video sequence and the previous frame of the frame of the video sequence; generate a background image of the video sequence based on the binary image; and construct an irrelevant data item in the irrelevant dataset using the background image of the video sequence.

[0031] In some embodiments of the second aspect, it further includes: an update unit configured to: obtain a CAM by inputting target-irrelevant data items into a trained classification model; and obtain a superposition result by superimposing the CAM with the target-irrelevant data items; and a display unit configured to display the superposition result.

[0032] Thirdly, a computing device is provided, including a processor and a memory, wherein the memory stores instructions executed by the processor, which, when executed by the processor, cause the computing device to: construct an irrelevant dataset based on a dataset to be processed, the irrelevant dataset including irrelevant data items with labels, the labels of the irrelevant data items being determined based on the labels of the data items to be processed in the dataset to be processed; divide the irrelevant dataset into a first dataset and a second dataset, the first dataset having a first sample weight distribution and the second dataset having a second sample weight distribution, the first sample weight distribution and the second sample weight distribution being determined based on the sample weights of the data items to be processed in the dataset to be processed; train a classification model based on the first dataset and the first sample weight distribution; and evaluate the classification model based on the second dataset and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the significance of bias in the dataset to be processed having the sample weight distribution.

[0033] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device performs the following: if the evaluation result is greater than a preset threshold, the sample weight distribution of the dataset to be processed is updated.

[0034] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device causes to update a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

[0035] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device implements the following: using the sample weight distribution when the evaluation result is not greater than a preset threshold as the recommended sample weight distribution.

[0036] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device performs the following: adding or deleting data in the dataset to be processed based on the recommended sample weight distribution to construct an unbiased dataset.

[0037] In some embodiments of the third aspect, when the instruction is executed by the processor, the device updates the sample weight distribution by at least one of the following: updating the sample weight distribution using a predetermined rule, updating the sample weight distribution in a random manner, obtaining user modifications to the sample weight distribution to update the sample weight distribution, or optimizing the sample weight distribution using a genetic algorithm to update the sample weight distribution.

[0038] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device performs the following: removing the portion associated with the label of the target data item from the target data item in the dataset to be processed, to obtain the remainder in the target data item; and using the remainder to construct an irrelevant data item in the irrelevant dataset, the label of the irrelevant data item corresponding to the label of the target data item.

[0039] In some embodiments of the third aspect, the dataset to be processed is an image dataset, and the computing device, when the instruction is executed by the processor, performs: image segmentation on a target data item to be processed in the dataset to obtain a background image corresponding to the target data item; and uses the background image to construct an irrelevant data item in the irrelevant dataset.

[0040] In some embodiments of the third aspect, the data item to be processed in the dataset to be processed is a video sequence, and the computing device, when the instruction is executed by the processor, performs the following: determining a binary image of the video sequence based on gradient information between a frame of the video sequence and the previous frame of the same frame; generating a background image of the video sequence based on the binary image; and constructing an irrelevant data item in the irrelevant dataset using the background image of the video sequence.

[0041] In some embodiments of the third aspect, when the instruction is executed by the processor, the computing device performs: obtaining a CAM by inputting target-independent data items into a trained classification model; obtaining a superposition result by superimposing the CAM with the target-independent data items; and displaying the superposition result.

[0042] Fourthly, a computer-readable storage medium is provided that stores a computer program thereon, which, when executed by a processor, implements the operation of the method according to the first aspect or any of the embodiments described above.

[0043] Fifthly, a chip or chip system is provided. The chip or chip system includes processing circuitry configured to perform operations according to the method of the first aspect or any of the embodiments described above.

[0044] In a sixth aspect, a computer program or computer program product is provided. The computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause a device to perform operation according to the method of the first aspect or any of the embodiments described above. Attached Figure Description

[0045] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0046] Figure 1 A schematic diagram of the structure of a system 100 according to an embodiment of the present disclosure is shown;

[0047] Figure 2 A schematic diagram of the structure of a dataset processing module 200 according to an embodiment of the present disclosure is shown;

[0048] Figure 3 A schematic diagram is shown of the process 300 in which the model training module 130 obtains the weights of the recommended samples according to an embodiment of the present disclosure;

[0049] Figure 4 A schematic diagram of a scenario 400 in which a system 100 according to an embodiment of the present disclosure is deployed in a cloud environment is shown;

[0050] Figure 5 A schematic diagram is shown illustrating a scenario 500 in which a system 100 according to an embodiment of the present disclosure is deployed in different environments;

[0051] Figure 6 A schematic diagram of the structure of a computing device 600 according to an embodiment of the present disclosure is shown;

[0052] Figure 7 A schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure is shown;

[0053] Figure 8 A schematic flowchart of a process 800 for constructing irrelevant data items according to an embodiment of the present disclosure is shown;

[0054] Figure 9 A schematic diagram of a process 900 for updating the sample weight distribution of a dataset to be processed according to an embodiment of the present disclosure is shown.

[0055] Figure 10 A schematic block diagram of a data processing apparatus 1000 according to an embodiment of the present disclosure is shown. Detailed Implementation

[0056] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0057] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0058] Artificial intelligence (AI) uses computers to simulate certain human thought processes and intelligent behaviors. The history of AI research follows a natural and clear trajectory, from a focus on "reasoning," to a focus on "knowledge," and then to a focus on "learning." AI has already been widely applied in various industries, including security, healthcare, transportation, education, and finance.

[0059] Machine learning is a branch of artificial intelligence that studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. In other words, machine learning studies how to improve the performance of specific algorithms through experiential learning.

[0060] Deep learning is a type of machine learning technique based on deep neural network algorithms. Its main characteristic is the use of multiple nonlinear transformation structures to process and analyze data. It is primarily applied in perception and decision-making scenarios within the field of artificial intelligence, such as image and speech recognition, natural language translation, and computer games.

[0061] Data and algorithms are two crucial pillars of artificial intelligence, and correspondingly, data bias is a key concern in the field. For a specific machine learning task, data containing factors that are related to the task but not causally related to it, such as imbalanced samples or the presence of artificial markers, can be considered data bias.

[0062] Dataset bias refers to the presence of spurious features in a dataset that a machine learning model might learn. Taking image datasets as an example, images may contain information related to the data acquisition device and parameters, which is irrelevant to the acquisition task. However, due to data acquisition flaws, machine learning models might make inferences based on this information, directly guessing the classification result instead of learning image features truly relevant to the target task.

[0063] When machine learning models are trained on image datasets with inherent biases, they may fail to learn the training task objectively and realistically as expected. This can lead to a significant performance degradation in real-world applications, or even if performance doesn't degrade, the reasons for errors may be unacceptable, potentially resulting in ethical controversies. For example, a lipstick prediction model showed almost no impact on its predictions after the lips were covered, indicating that the model hadn't actually learned lip-related features. Similarly, a medical image recognition model inferred the image capture location based on markers placed by doctors, thus affecting its prediction results.

[0064] One current approach is to crop out regions that might affect model learning, or to adjust color, grayscale, and other parameters for image data to avoid the impact of these data biases on model training. However, this method is difficult to exhaustively identify all biases and is labor-intensive, requiring significant manpower and time.

[0065] In view of this, embodiments of the present disclosure provide a scheme for quantitatively assessing dataset bias, thereby effectively determining the impact of dataset bias and enabling adjustments to the dataset accordingly, ensuring that the adjusted dataset does not negatively affect the model due to data bias.

[0066] Figure 1 A schematic diagram of the structure of a system 100 according to an embodiment of the present disclosure is shown. For example... Figure 1 As shown, system 100 can be as follows Figure 1 As shown, the system architecture 100 includes an input / output (I / O) module 110, a dataset processing module 120, and a model training module 130. Optionally, as... Figure 1As shown, system 100 may also include model storage module 140 and data storage module 150. Figure 1 The modules shown can communicate with each other.

[0067] The input / output module 110 can be used to acquire a dataset to be processed. For example, it can receive a dataset to be processed by user input.

[0068] Optionally, the dataset to be processed can be stored in the data storage module 150. As an example, the data storage module 150 can be a data storage resource corresponding to an object storage service (OBS) provided by a cloud service provider.

[0069] The dataset to be processed contains a large number of data items, each with a label. In other words, the dataset to be processed contains multiple data items with labels.

[0070] Tags can be manually labeled or obtained through machine learning or other methods; this disclosure does not limit the scope. Tags may also be called task tags, annotation information, or other names, which will not be listed here.

[0071] In some examples, the annotation information can be provided by annotators based on experience, targeting specific parts of the data items to be processed. Alternatively, the annotation information can be provided using image recognition models and annotation models.

[0072] For example, for image data including faces, labels can be added to the face, such as gender, age, whether glasses are worn, whether a hat is worn, face size, etc. For example, for medical images (such as ultrasound images), labels can be added to the detected areas to indicate whether there are lesions.

[0073] It is understandable that the data items to be processed can include parts related to the label and parts unrelated to the label. Taking the face image above as an example, assuming the label is for the face (e.g., marking the face location with a bounding box), then the face region in the image is the part related to the label, while the other regions in the image are parts unrelated to the label. Assuming the label is for the eyes (e.g., labeling pupil color with "black," "brown," etc.), then the eye region in the image is the part related to the label, while the other regions in the image are parts unrelated to the label.

[0074] The data items to be processed in the dataset can be of any type, such as images, videos, audio, text, etc. For ease of description, images will be used as an example below.

[0075] The embodiments of this disclosure do not limit the source of the data items to be processed. Taking images as an example, they may be collected from open source datasets, collected by different image acquisition devices, collected by the same image acquisition device at different times, image frames in a video sequence collected by an image acquisition device, or any combination of the above, or others, etc.

[0076] The input / output module 110 can be implemented as independent input and output modules, or as a coupled module that simultaneously performs input and output functions. As an example, it can be implemented using a graphical user interface (GUI) or a command-line interface (CLI).

[0077] The dataset processing module 120 can obtain the dataset to be processed from the input / output module 110, or optionally, from the data storage module 150. Further, the dataset processing module 120 can construct an irrelevant dataset based on the dataset to be processed. The irrelevant dataset includes labeled irrelevant data items, and the labels of the irrelevant data items are determined based on the labels of the data items to be processed in the dataset to be processed.

[0078] Optionally, the unrelated dataset can be stored in the data storage module 150.

[0079] As described above, the data item to be processed has a label, and the data item includes a part related to the label and a part unrelated to the label. Therefore, the part related to the label in the data item to be processed can be removed, retaining only the part unrelated to the label, which is designated as an unrelated data item, and the label of this unrelated data item becomes the label of the data item to be processed. This process may also be referred to as splitting, segmenting, separating, or other names, and this disclosure is not limited thereto.

[0080] In other words, for a specific data item in the dataset to be processed (called the target data item), the part associated with its label can be removed from the target data item to obtain the remaining part. This remaining part is then used to construct an irrelevant data item in the irrelevant dataset, where the label of the irrelevant data item corresponds to the label of the target data item.

[0081] For example, suppose the data item to be processed is a face image, and the label represents the skin color of the face, such as "white". Then, the face region can be removed from the face image, and the remaining part after removing the face region can be used as the corresponding irrelevant data item, and the irrelevant data item still has the label "white" representing the skin color of the face.

[0082] In some implementations, if the data items to be processed in the dataset are images, then irrelevant data items can be obtained through image segmentation. The part of the image associated with the label is the foreground region, and the other regions in the image are the background regions. Therefore, irrelevant data items can be determined based solely on the background regions through foreground-background separation.

[0083] Specifically, image segmentation is performed on the target data item (target image) in the dataset to be processed to obtain the background image corresponding to the target image, and then irrelevant data items are constructed using the background image.

[0084] This disclosure does not limit the specific algorithm used for image segmentation. For example, one or more of the algorithms listed below can be used, or other algorithms can be used: threshold-based image segmentation algorithms, region-based image segmentation algorithms, edge detection-based image segmentation algorithms, wavelet analysis and wavelet transform-based image segmentation algorithms, genetic algorithm-based image segmentation algorithms, active contour model-based image segmentation algorithms, deep learning-based image segmentation algorithms, etc. Among them, deep learning-based image segmentation algorithms include, but are not limited to: feature encoder-based segmentation algorithms, region proposal-based segmentation algorithms, RNN-based segmentation algorithms, upsampling / deconvolution-based segmentation algorithms, feature resolution-enhanced segmentation algorithms, feature enhancement-based segmentation algorithms, and segmentation algorithms using Conditional Random Field (CRF) / Markov Random Field (MRF), etc.

[0085] In other implementations, if the data items to be processed in the dataset are video sequences, different data items can have the same or different durations. For example, the first data item to be processed in the dataset is a first video sequence with a length of m1 frames, including m1 frames of images. Similarly, the second data item to be processed in the dataset is a second video sequence with a length of m2 frames, including m2 frames of images. m1 and m2 can be equal or unequal.

[0086] Specifically, video segmentation is performed on the target data item (target video sequence) in the dataset to be processed to obtain the background image corresponding to the target video sequence, and then irrelevant data items are constructed using the background image.

[0087] This disclosure does not limit the specific algorithm used for video segmentation. As one example, image segmentation can be performed on each frame of the target video sequence, and the segmented background regions of each frame can be merged to obtain a background image corresponding to the target video sequence. As another example, the background image corresponding to the target video sequence can be obtained based on the gradient between two adjacent frames in the target video sequence. Specifically, a binary image corresponding to the video sequence can be obtained based on the gradient information of the video sequence. Then, the background image of the video sequence is generated based on this binary image, as follows... Figure 2 As stated above.

[0088] Figure 2 A schematic diagram of a dataset processing module 200 according to an embodiment of the present disclosure is shown. The dataset processing module 200 can be used as... Figure 1 One implementation of the dataset processing module 120 is as follows: the dataset processing module 200 can be used to determine irrelevant datasets based on the dataset to be processed, wherein the data items to be processed in the dataset to be processed are video sequences, and the irrelevant data items in the irrelevant dataset can be background images corresponding to the video sequences.

[0089] like Figure 2 As shown, the dataset processing module 200 may include a gradient calculation submodule 210, a gradient stacking submodule 220, a thresholding submodule 230, a morphological processing submodule 240, and a separation submodule 250.

[0090] The gradient calculation submodule 210 can be used to calculate the gradient information between a frame of a target video sequence and the previous frame.

[0091] For example, suppose the target video sequence includes m1 frames, namely frame 0, frame 1, ..., frame m1-1. Then we can calculate the gradient information between each pair of adjacent frames. Specifically, we can calculate the gradient information between frame 1 and frame 0, between frame 2 and frame 1, ... between frame m1-1 and frame m1-2.

[0092] The embodiments of this disclosure do not limit the specific method of calculating gradient information; for example, frame differences can be calculated. For instance, the gradient of the feature vectors of two frames along a specific dimension (e.g., the time dimension T) can be calculated, enabling the extraction of static background parts, such as image borders, from the video sequence using motion information. Alternatively, the difference between the image and its grayscale counterpart can be calculated, thereby extracting the colored portions of the video frame images. This avoids using colored markers as foreground elements, such as colored markers or text added later after video capture.

[0093] The gradient superposition submodule 220 can be used to superimpose the gradient information obtained by the gradient calculation submodule 210 to obtain a gradient superposition map.

[0094] The gradient superposition submodule 220 can perform superposition in ways including but not limited to weighted summation (such as average), finding the maximum value, finding the minimum value, or others.

[0095] The thresholding submodule 230 can be used to threshold the gradient superposition map obtained by the gradient superposition submodule 220 to obtain an initial binary map.

[0096] Specifically, for each pixel in the gradient overlay image, pixels with values ​​greater than a threshold are marked as 1, and pixels with values ​​less than or equal to the threshold are marked as 0, thus obtaining an initial binary image in which the pixel values ​​are either 1 or 0.

[0097] The morphological processing submodule 240 can perform morphological processing on the initial binary image obtained by the thresholding submodule 230 to obtain the binary image corresponding to the video sequence.

[0098] For example, if a pixel has a value of 1 in the initial binary image, but all of its neighboring pixels have a value of 0, then the pixel value of that pixel can be reset to 0.

[0099] For example, morphological processing may include, but is not limited to, morphological dilation, morphological erosion, etc. For instance, the morphological processing submodule 240 may perform several morphological dilations on the initial binary image obtained by the thresholding submodule 230, and then perform the same number of morphological erosions to obtain a binary image.

[0100] The separation submodule 250 can obtain the background image corresponding to the video sequence based on the binary image obtained by the morphological processing submodule 240.

[0101] For example, a background image can be obtained by performing a matting operation on a binary image. This can be achieved, for instance, through matrix dot product.

[0102] In this way, the similarity of the background between the images in each frame of the video sequence can be fully considered to obtain the background image corresponding to the video sequence.

[0103] Thus, in the embodiments of this disclosure, the background image is used as a representative of bias, thereby enabling bias checking of the dataset. It is understood that if the dataset is unbiased, then the features of the background image should not have any relation to the labels associated with the foreground region.

[0104] Suppose the dataset to be processed includes N data items to be processed, and the irrelevant dataset includes N1 irrelevant data items. If processing is performed for each data item to be processed to obtain the corresponding irrelevant data item, then N1 = N. If processing is performed for some data items in the dataset to be processed to obtain the corresponding irrelevant data item, then N1 < N. It can be understood that the irrelevant dataset obtained by processing all the data items to be processed has more irrelevant data items, and thus the dataset to be processed can be analyzed and evaluated more completely and comprehensively.

[0105] In one implementation, the constructed irrelevant dataset can be divided into two parts: the first part of irrelevant data items and the second part of irrelevant data items. Among them, the first part of irrelevant data items can be used to train the model, and the second part of irrelevant data items can be used to test the model. The embodiments of the present disclosure do not limit this division method. As an example, the irrelevant dataset can be divided into the first part and the second part according to a ratio of 9:1 or 1:1 or other ratios.

[0106] Exemplarily, the set composed of the first part of irrelevant data items can be called the irrelevant training set, and the set composed of the second part of irrelevant data items can be called the irrelevant test set. Alternatively, the set composed of the first part of irrelevant data items may include an irrelevant training set and an irrelevant validation set. As an example, the irrelevant dataset can be divided into an irrelevant training set, an irrelevant validation set, and an irrelevant test set according to a ratio of 7:2:1.

[0107] For the sake of simplicity of description, hereinafter, the set composed of the first part of irrelevant data items will be called the first dataset (or training set), and the set composed of the second part of irrelevant data items will be called the second dataset (or test set).

[0108] In some embodiments, the dataset processing module 120 can first preprocess the dataset to be processed, and then construct an irrelevant dataset based on the preprocessed dataset to be processed. The preprocessing includes but is not limited to: clustering analysis, data denoising, etc.

[0109] The model training module 130 can include a training sub-module 132 and an evaluation sub-module 134.

[0110] The training sub-module 132 can be used to train the classification model. Specifically, the classification model can be trained based on the first part of irrelevant data items in the irrelevant dataset and the label of each irrelevant data item in this first part.

[0111] One implementation is that the first part of the irrelevant data items used for training can be the entire irrelevant dataset. This allows for the use of more data items in training, making the trained classification model more robust. Another implementation is that the first part of the irrelevant data items used for training can be a portion of the irrelevant dataset, as described above, where the irrelevant dataset is divided into a first part of irrelevant data items and a second part of irrelevant data items.

[0112] For ease of description below, the set of the first part of irrelevant data items used for training will be called the training set, and correspondingly, the first part of irrelevant data items can be called training items.

[0113] It should be noted that the training here can be training an initial classification model or updating a previously trained classification model, wherein the initial classification model can be an untrained classification model. The previously trained classification model can be obtained by training the initial classification model. As an example, the training submodule 132 can obtain the initial classification model or the previously trained classification model from the model storage module 140.

[0114] Training submodule 132 can obtain a first portion of irrelevant data items and the label of each irrelevant data item in the irrelevant dataset used for training from dataset processing module 120 or data storage module 150. Alternatively, training submodule 132 can obtain a first portion of irrelevant data items in the irrelevant dataset used for training from dataset processing module 120 and obtain the label of each irrelevant data item in the first portion of irrelevant data items from input / output module 110.

[0115] Optionally, before training based on the training set (the first part of the irrelevant data items in the irrelevant dataset), the training submodule 132 can preprocess the training set, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, etc. For example, the training data items after feature extraction can be represented as S-dimensional feature vectors, where S is greater than 1.

[0116] It is understood that the present disclosure does not limit the model structure of the classification model. As an example, the classification model can be a convolutional neural network (CNN) model, which may optionally include an input layer, a convolutional layer, a deconvolutional layer, a pooling layer, a fully connected layer, an output layer, etc.

[0117] Classification models include numerous parameters, which represent the weights of computational formulas or factors within the model. These parameters are iteratively updated during training. Classification models also include hyperparameters, which guide the construction or training of classifications. Hyperparameters include the number of training iterations, learning rate, batch size, number of layers, and number of neurons per layer. Hyperparameters can be obtained by training the model on a training set or are pre-defined parameters that are not updated during model training.

[0118] For example, the training submodule 132 can train the classification model by referring to existing training processes. As an illustrative description, this training process may involve: inputting training data items from the training set into the classification model, using the labels corresponding to the training data as a reference, obtaining the loss value between the output of the classification model and the corresponding label using a loss function, and adjusting the parameters of the classification model based on this loss value. The classification model is trained iteratively for each training data item in the training set, and the parameters of the classification model are continuously adjusted until the classification model can output an output value that is closer to the label corresponding to the training data item with higher accuracy based on the input training data item, for example, by minimizing the loss function or reaching a reference threshold.

[0119] The loss function during training measures the degree to which the classification model has been trained (i.e., it calculates the difference between the model's predictions and the true values). During training, to ensure the model's output is as close as possible to the true values ​​(the corresponding labels), the model's predictions are compared to the true values, and the parameters are updated based on the differences. Each training iteration uses the loss function to determine the difference between the model's predictions and the true values, updating the model's parameters until the model can predict values ​​very close to the true values; at this point, the model is considered successfully trained.

[0120] The "classification model" in the embodiments of this disclosure may also be referred to as a machine learning model, a convolutional classification model, a background classification model, a data bias model, or other names, or may be simply referred to as a "model," etc., and this disclosure is not limited thereto. Optionally, the trained classification model may be stored in the model storage module 140. In some examples, the model storage module 140 may be part of the model training module 130.

[0121] The evaluation submodule 134 can be used to evaluate the classification model. Specifically, it can determine the evaluation result for the trained classification model based on the second part of irrelevant data items in the irrelevant dataset and the label of each irrelevant data item in the second part. This evaluation result can be used to characterize the significance of data bias in the dataset to be processed.

[0122] As mentioned above, the set of irrelevant data items in the second part can be the test set, and correspondingly, the irrelevant data items in the second part can be the test data items.

[0123] As an example, the evaluation process may include: inputting a test data item into a trained classification model to obtain a prediction result for the test data item, and determining an evaluation result based on a comparison between the prediction result and the label of the test data item.

[0124] In this embodiment of the disclosure, the evaluation results may include at least one of the following: accuracy, precision, recall, F1 score, precision-recall (PR) curve, average precision (AP) index, false positive rate, false negative rate, etc.

[0125] Specifically, a confusion matrix can be constructed, which shows the number of positive and negative examples, their true values, and their predicted values.

[0126] Accuracy refers to the proportion of correctly classified samples out of the total samples. For example, if the number of test data items in the test set is N2, and the number of predictions that match the labels is N21, then the accuracy can be expressed as N21 / N2.

[0127] Accuracy, also known as precision, refers to the proportion of samples that are predicted to be positive and are actually positive. For example, if the number of test data items in the test set is N2, and the number of positive examples in the prediction results is N22, and the number of positive examples among these N22 test data items is N23, then the accuracy can be expressed as N23 / N22.

[0128] Recall is the proportion of samples that are actually positive that are predicted to be positive. For example, if the number of test data items in the test set is N2, and the number of those labeled as positive is N31, then if the number of predicted positive examples among these N31 positive examples is also N32, then the recall rate can be expressed as N32 / N31.

[0129] A PR curve defines recall on the horizontal axis and precision on the vertical axis. A point on the PR curve represents the recall and precision of the returned results at a certain threshold, where the model classifies results greater than the threshold as positive samples and results less than the threshold as negative samples. The entire PR curve is generated by shifting the threshold from high to low. The area near the origin represents the precision and recall of the model when the threshold is at its maximum.

[0130] The F1 score, also known as the F1 index, is the harmonic mean of precision and recall. For example, the F1 score can be calculated as the ratio of twice the product of precision and recall to the sum of precision and recall.

[0131] In some embodiments of this disclosure, the evaluation results may include positive example representation values, such as a first precision and / or a first recall. The first precision represents the proportion of samples predicted as positive that are also actually positive. The first recall represents the proportion of samples that were actually positive that were predicted as positive. The evaluation results may also include negative example representation values, such as a second precision and / or a second recall. The second precision represents the proportion of samples predicted as negative that are also actually negative. The second recall represents the proportion of samples that were actually negative that were predicted as negative.

[0132] In some embodiments of this disclosure, the evaluation results may include a first predicted mean and / or a second predicted mean. The first predicted mean represents the average of the predicted values ​​for samples that were actually positive. The second predicted mean represents the average of the predicted values ​​for samples that were actually negative. The evaluation results may include a mean difference to represent the difference between the first predicted mean and the second predicted mean, such as the difference between the first predicted mean and the second predicted mean or the ratio of the first predicted mean to the second predicted mean.

[0133] It should be understood that the above are only some examples of evaluation results, and other characteristics can also be used as evaluation results, which will not be listed in this disclosure.

[0134] For example, the evaluation results can be presented to the user by the input / output module 110. This can be done, for instance, through a graphical user interface for easy viewing.

[0135] Thus, through the embodiments of this disclosure, the salience of bias in a dataset can be characterized quantitatively. This quantitative evaluation scheme provides users with a clear reference, facilitating adjustments and other processing of the dataset.

[0136] In scenarios where the input / output module 110 includes a graphical user interface, the input / output module 110 can also present the representation of dataset bias in a visual manner through the graphical user interface.

[0137] Specifically, by inputting target-irrelevant data items into a trained classification model, a Class Activation Map (CAM) is obtained. The CAM is then superimposed with the target-irrelevant data items to obtain the superimposed result, which is then displayed.

[0138] Class activation maps, also known as class activation heatmaps, enable embodiments of this disclosure to characterize the regions of interest of a classification model using CAM, specifically identifying which regions (i.e., the model's regions of interest) lead to bias.

[0139] The embodiments disclosed herein do not limit the specific method for obtaining the CAM. As an example, a gradient-based CAM (Grad-CAM) method can be used to obtain the CAM. For instance, the output of the last convolutional layer of the classification model, i.e., the last layer feature map, can be extracted, and the extracted last layer feature map can be weighted and summed to obtain the CAM. Optionally, the weighted sum can also be processed by a Rectified Linear Unit (ReLU) activation function to obtain the CAM. The weights used for weighting here can be the weights of the top fully connected layer. As an example, the partial derivatives of the output of the last layer softmax of the classification model with respect to all pixels of the last layer feature map can be calculated, and then the global average across the width and height dimensions can be taken as the corresponding weights.

[0140] The embodiments of this disclosure do not limit the way the CAM is superimposed on target-irrelevant data items (such as background images). For example, a weighted summation method can be used for superposition. As an example, the weights of the CAM and the background image can be equal.

[0141] Therefore, the embodiments of this disclosure provide a scheme for quantitatively assessing and visually representing dataset bias, thereby clearly characterizing the salience of dataset bias and visually presenting the specific locations where bias occurs. This allows users to understand the dataset bias situation more intuitively and comprehensively. This scheme requires minimal user intervention, can be automated, and improves processing efficiency while ensuring the accuracy of quantitative bias assessment.

[0142] The model training module 130 can also be used to adjust the dataset to be processed based on the classification model.

[0143] Specifically, the dataset to be processed may have an initial sample weight distribution. Correspondingly, the first dataset has a first sample weight distribution, and the second dataset has a second sample weight distribution. For example, assuming the initial sample weight of the target data item to be processed is 'a', then the sample weight of the irrelevant data items generated based on the target data item to be processed is also 'a'.

[0144] For example, the model training module 130 can be used to obtain the weight distribution of recommendation samples based on iterative training of the classification model, as follows: Figure 3 As stated above.

[0145] Figure 3 A schematic diagram is shown of the process 300 by which the model training module 130 obtains the recommended sample weights according to an embodiment of the present disclosure.

[0146] At point 310, a first dataset with a first sample weight distribution and a second dataset with a second sample weight distribution are determined.

[0147] Specifically, an irrelevant dataset can be constructed based on the dataset to be processed, and the irrelevant dataset can be divided into a first dataset and a second dataset, as described in the example above.

[0148] For example, the data items to be processed in the dataset can have initial sample weights; that is, the dataset can have an initial sample weight distribution. As one example, the initial sample weights can be input by the user through the input / output module 110. As another example, the initial sample weights can be determined through an initialization process.

[0149] Sample weights can be used to indicate the sampling probability of a data item to be processed. For example, suppose the first... i The sample weights of the data items to be processed are: Then the first i The sampling probability of each data item to be processed is .

[0150] As an example, the initial sample weight distribution can indicate that the sampling probability of each data item in the dataset to be processed is equal. Suppose that the dataset to be processed includes N data items to be processed, and the initial sample weight of each data item to be processed is 1, then the sampling probability of each data item to be processed is initialized to 1 / N.

[0151] It is understandable that determining the initial sample weight distribution allows us to determine the first sample weight distribution and the second sample weight distribution accordingly.

[0152] At position 320, the first dataset is sampled based on the weight distribution of the first sample, and the classification model is trained iteratively.

[0153] At point 330, the classification model trained on S320 is evaluated based on the second dataset, and the evaluation results are obtained.

[0154] For example, the evaluation result can be obtained by comparing the prediction result of an irrelevant data item in the second dataset by a trained classification model with the label of that irrelevant data item. As an example, irrelevant data items can be input into a trained classification model to obtain prediction results for those irrelevant data items. The evaluation result is determined based on the comparison between the prediction results and the labels of the irrelevant data items. The evaluation result can include at least one of the following: precision, accuracy, recall, F1 score, precision-recall curve, mean precision metric, false positive rate, false negative rate, etc. The evaluation results are described above and will not be repeated here.

[0155] At point 340, determine whether the significance of the bias indicated by the assessment results is high.

[0156] If the judgment at point 340 indicates a high significance of the evaluation result indicating bias (e.g., the evaluation result is greater than a preset threshold), then the process can proceed to point 350. Otherwise, if the judgment at point 340 indicates a low significance of the evaluation result indicating bias (e.g., the evaluation result is not greater than a preset threshold), then the process can proceed to point 360.

[0157] The preset threshold can be set based on the processing accuracy of the dataset to be processed and the application scenario. The preset threshold can be related to the specific meaning of the evaluation result. For example, if the evaluation result includes accuracy, the preset threshold can be set to, for example, 30%, 50%, or other values.

[0158] At position 350, update the sample weight distribution.

[0159] Reference Figure 3 ,like Figure 3 As shown by the dashed arrow, after 350, you can return to 310 or 320 to continue execution.

[0160] In one example, execution can return to 310 and continue, meaning the first and second datasets are reconstructed. Thus, an irrelevant data item that might have belonged to the first dataset in the previous loop could belong to either the first or second dataset in the next loop.

[0161] In another example, execution can return to 320 to continue, meaning that irrelevant data items in the first and second datasets have not changed, but the weight distribution of the first sample and / or the weight distribution of the second sample have been updated.

[0162] After 350 iterations, the first dataset can be resampled based on the updated first sample weight distribution, and the classification model can be retrained iteratively. The retrained classification model can then be evaluated based on the second dataset to obtain a new evaluation result.

[0163] This can be iteratively performed from 310 to 350 or from 320 to 350 until the evaluation results indicate that the bias is not significant (e.g., the evaluation results are not greater than a preset threshold).

[0164] The embodiments disclosed herein do not limit the specific implementation method of updating the sample weight distribution.

[0165] As an example, the sample weight distribution can be updated randomly. For instance, the sample weights of some data items to be processed can be updated randomly, such as changing the sample weight of one data item from 1 to 2, and changing the sample weight of another data item from 1 to 3, and so on. Understandably, this random approach has uncertainty, which may make the process of obtaining the recommended sample weight distribution time-consuming.

[0166] As another example, the sample weight distribution can be updated using predetermined rules. For instance, the weight distribution of the second sample can be updated. For example, if the evaluation results indicate that the classification model's prediction for an irrelevant data item in the second dataset differs from the label of that irrelevant data item, the sample weight of that irrelevant data item can be increased. For example, the sample weight of that irrelevant data item can be updated from a1 to a1+1 or 2. a1 or others. In this example, the first sample weight distribution can remain unchanged or can be updated in other ways. Optionally, in this example, after updating the sample weight distribution, the second dataset can be swapped with the first dataset before proceeding to the next iteration. For example, in the next iteration, the classification model will be trained based on the second dataset from the previous iteration and the updated second sample weight distribution.

[0167] As another example, the sample weight distribution can be optimized and updated using a genetic algorithm. For instance, the sample weight distribution can be used as the initial values ​​of the genes in the genetic algorithm. An objective function can be constructed based on the evaluation results obtained at 330, allowing the genetic algorithm to optimize the sample weight distribution. The optimized sample weight distribution is the updated sample weight distribution. The embodiments of this disclosure do not limit the construction method of the objective function of the genetic algorithm. For example, if the evaluation results include the mean difference between positive and negative samples and the accuracy, then the sum of the mean difference and the accuracy can be used as the objective function. It is understood that other methods can also be used to construct the objective function, which will not be listed here.

[0168] Thus, embodiments of this disclosure can update the sample weight distribution of the dataset to be processed based on a trained classification model, thereby obtaining a recommended sample weight distribution. This process does not require user intervention and is highly automated.

[0169] As another example, user modifications to the sample weight distribution can be obtained to update the sample weight distribution. For instance, a user can refer to the evaluation results and / or the displayed overlay results (as described above), infer from experience what modifications to the sample weight distribution should be made, and then input the modifications through the input / output module 110 to update the sample weight distribution.

[0170] In this way, the method can fully consider user needs and update the sample weight distribution based on user modifications, so that the resulting recommended sample weight distribution can better meet user expectations and improve user satisfaction.

[0171] At position 360, the weight distribution of the recommended samples is obtained.

[0172] If the significance of the bias in the assessment result is not high at 340 judgments, for example, if the assessment result is not greater than the preset threshold, then the sample weight distribution of the current assessment result can be used as the recommended sample weight distribution.

[0173] Thus, embodiments of this disclosure can update the sample weight distribution based on iteratively training a classification model, and can observe the changes in dataset bias as the sample weight distribution is updated. This allows for iterative detection of the dataset to be processed, resulting in an effective and highly reliable recommended sample weight distribution.

[0174] The input / output module 110 can also present the weight distribution of the recommended samples for the user to use as a reference for further adjustments to the dataset to be processed. For example, the weight distribution of the recommended samples can be presented visually through a graphical user interface.

[0175] For example, the dataset processing module 120 can add or delete data in the dataset to be processed based on the obtained weight distribution of the recommended samples to construct an unbiased dataset.

[0176] As an example, the dataset processing module 120 can copy data items with high recommendation sample weights to increase the number of data items in the dataset. The dataset processing module 120 can also delete data items with low recommendation sample weights to reduce the number of data items in the dataset.

[0177] As an example, the input / output module 110 can receive a user's deletion instruction for a portion of the data items to be processed, thereby deleting that portion of the data items. The input / output module 110 can also receive other data items input by the user, adding them to the current dataset to be processed.

[0178] For example, users can add or delete data from the dataset to be processed based on the weight distribution of recommended samples. For instance, a user can find other samples similar to the data items to be processed with high recommended sample weights and add them as new data items to the dataset, thus supplementing the dataset. As an example, similar other samples could be other images captured by the same (or the same model) image acquisition device in similar environments (such as lighting conditions).

[0179] Thus, in the embodiments of this disclosure, the dataset to be processed can be added to or deleted based on the weight distribution of recommended samples, thereby constructing an unbiased dataset. Furthermore, this unbiased dataset can be used to train more robust and unbiased models for specific tasks.

[0180] Understandable Figure 1 The system 100 shown can be a system capable of interacting with a user, and the system 10 can be a software system, a hardware system, or a combination of software and hardware.

[0181] In some examples, the system 100 may be implemented as a computing device or part of a computing device, wherein the computing device includes, but is not limited to, desktop computers, mobile terminals, wearable devices, servers, cloud servers, etc.

[0182] Understandable Figure 1 The system 100 shown can be implemented as an artificial intelligence platform (AI platform). An AI platform provides AI developers and users with a convenient AI development environment and convenient development tools. The AI ​​platform can have various built-in AI models or AI sub-models for solving different problems. The AI ​​platform can build suitable AI models based on user input requirements. That is, the user only needs to define their needs in the AI ​​platform and prepare the dataset according to the prompts, then upload it to the AI ​​platform, and the AI ​​platform can train an AI model that can be used to achieve the user's needs. The AI ​​model in this embodiment can be used to evaluate data bias in the user-input dataset to be processed.

[0183] Figure 4 A schematic diagram of scenario 400, in which system 100 according to an embodiment of the present disclosure is deployed in a cloud environment, is shown. In scenario 400, system 100 is deployed entirely in cloud environment 410.

[0184] Cloud environment 410 is an entity that provides cloud services to users using basic resources under the cloud computing model. Cloud environment 410 includes cloud data center 412 and cloud service platform 414. Cloud data center 412 includes a large number of basic resources (including computing resources, storage resources, and network resources) owned by the cloud service provider. The computing resources included in cloud data center 412 can be a large number of computing devices (such as servers). System 100 can be deployed independently on servers or virtual machines within cloud data center 412, or system 100 can be deployed distributedly on multiple servers within cloud data center 412, or distributedly on multiple virtual machines within cloud data center 412, or distributedly on servers and virtual machines within cloud data center 412.

[0185] like Figure 4 As shown, system 100 can be abstracted into an AI development cloud service 424 by a cloud service provider on a cloud service platform 414 and provided to users. After users purchase this cloud service on the cloud service platform 414 (pre-payment is possible, with settlement based on final resource usage), the cloud environment 410 utilizes system platform 100 deployed in cloud data center 412 to provide the AI ​​development cloud service 424 to users. When using the AI ​​development cloud service 424, users can upload datasets to be processed through an application program interface (API) or GUI. System 100 in cloud environment 410 receives the datasets uploaded by users and can perform operations such as dataset processing, model training, and dataset adjustment. System 100 can return model evaluation results and recommended sample weight distribution to users through API or GUI.

[0186] In another embodiment of this application, when the system 100 under the cloud environment 410 is abstracted into an AI development cloud service 424 and provided to users, it can be divided into two parts, such as a dataset bias assessment cloud service and a dataset adjustment cloud service. Users can purchase only the dataset bias assessment cloud service on the cloud service platform 414. This cloud service platform 414 can construct an irrelevant dataset based on the user-uploaded dataset to be processed, obtain a classification model through training, and return the evaluation results of the classification model to the user so that the user can know the significance of the bias in the dataset to be processed. Users can also further purchase the dataset adjustment cloud service on the cloud service platform 414. This cloud service platform 414 can iteratively train the classification model based on the sample weight distribution, update the sample weight distribution, and return a recommended sample weight distribution to the user so that the user can refer to the recommended sample weight distribution to add or delete data in the dataset to be processed to construct an unbiased dataset.

[0187] Figure 5A schematic diagram of scenario 500, in which system 100 according to an embodiment of the present disclosure is deployed in different environments, is shown. In scenario 500, system 100 is deployed in a distributed manner in different environments, which may include, but are not limited to, at least two of cloud environment 510, edge environment 520, and terminal computing device 530.

[0188] System 100 can be logically divided into multiple parts, each with a different function. For example, ... Figure 1 As shown, system 100 includes an input / output module 110, a dataset processing module 120, a model training module 130, a model storage module 140, and a data storage module 150. Each part of system 100 can be deployed in any two or three of the following environments: a terminal computing device 530, an edge environment 520, and a cloud environment 510. The various parts of system 100 deployed in different environments work together to provide users with various functions. For example, in one scenario, the input / output module 110 and data storage module 150 of system 100 are deployed in the terminal computing device 530, the dataset processing module 120 of system 100 is deployed in the edge computing device of the edge environment 520, and the model training module 130 and model storage module 140 of system 100 are deployed in the cloud environment 510. The user sends the dataset to be processed to the input / output module 110 in the terminal computing device 530, and the terminal computing device 530 stores the dataset to be processed in the data storage module 150. The dataset processing module 120 in the edge computing device of the edge environment 520 constructs an irrelevant dataset based on the dataset to be processed from the terminal computing device 530. The model training module 130 in the cloud environment 510 trains a classification model based on the irrelevant dataset from the edge environment 520. The cloud environment 510 can also store the trained classification model in the model storage module 140. It should be understood that this application does not restrict which parts of the system 100 are specifically deployed in which environment. In actual applications, the deployment can be adaptively made according to the computing power of the terminal computing device 530, the resource availability of the edge environment 520 and the cloud environment 510, or specific application requirements.

[0189] Edge environment 520 is an environment that includes a collection of edge computing devices located close to terminal computing device 530. Edge computing devices include, but are not limited to, edge servers and edge stations with computing capabilities. It can be understood that system 100 can be deployed alone on a single edge server in edge environment 520, or it can be distributed across multiple edge servers in edge environment 520.

[0190] The terminal computing device 530 includes, but is not limited to, terminal servers, smartphones, laptops, tablets, personal desktop computers, and smart cameras. It is understood that the system 100 can be deployed on a single terminal computing device 530, or it can be distributed across multiple terminal computing devices 530.

[0191] Figure 6 A schematic diagram of the structure of a computing device 600 according to an embodiment of the present disclosure is shown. Figure 6 The computing device 600 in the middle can be implemented as Figure 5 Devices in the cloud environment 510, devices in the edge environment 520, or terminal computing devices 530. It should be understood that... Figure 6 The computing device 600 shown can also be regarded as a cluster of computing devices, that is, the computing device 600 includes one or more devices in the aforementioned cloud environment 510, edge environment 520, and terminal computing device 530.

[0192] like Figure 6 As shown, the computing device 600 includes a memory 610, a processor 620, a communication interface 630, and a bus 640, wherein the bus 640 is used for communication between the various components of the computing device 600.

[0193] The memory 610 may be a read-only memory (ROM), a random access memory (RAM), a hard disk, flash memory, or any combination thereof. The memory 610 may store programs, and when the programs stored in the memory 610 are executed by the processor 620, the processor 620 and the communication interface 630 are used to execute the processes that the various modules in the system 100 described above can perform. It should be understood that the processor 620 and the communication interface 630 may also be used to execute some or all of the contents of the data processing method embodiments described below in this specification. The memory may also store datasets and classification models. For example, a portion of the storage resources in the memory 610 may be divided into a data storage module for storing datasets, such as datasets to be processed, irrelevant datasets, etc., and a portion of the storage resources in the memory 610 may be divided into a model storage module for storing classification models.

[0194] Processor 620 may be a Central Processing Unit (CPU), an Application-Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or any combination thereof. Processor 620 may include one or more chips. Processor 620 may include an accelerator, such as a Neural Processing Unit (NPU).

[0195] The communication interface 630 uses a transceiver module, such as a transceiver, to enable communication between the computing device 600 and other devices or communication networks. For example, data can be acquired through the communication interface 630.

[0196] Bus 640 may include a pathway for transmitting information between various components of computing device 600 (e.g., memory 610, processor 620, communication interface 630).

[0197] Figure 7 A schematic flowchart of a data processing method 700 according to an embodiment of the present disclosure is shown. Figure 7 The method 700 shown can be executed by system 100.

[0198] like Figure 7 As shown in box 710, an irrelevant dataset is constructed based on the dataset to be processed. The irrelevant dataset includes irrelevant data items with labels, and the labels of the irrelevant data items are determined based on the labels of the data items to be processed in the dataset to be processed.

[0199] For example, the dataset to be processed includes multiple data items to be processed, each with a label. Each data item to be processed may include portions related to the label and portions unrelated to the label.

[0200] In some embodiments, the portion associated with the label of the target data item can be removed from the target data item in the dataset to be processed, resulting in the remaining portion of the target data item. This remaining portion is then used to construct an irrelevant data item in the irrelevant dataset, where the label of the irrelevant data item corresponds to the label of the target data item.

[0201] In some embodiments, the dataset to be processed is an image dataset, that is, the data items to be processed are images. Image segmentation can then be performed on the target data item in the dataset to obtain a background image corresponding to the target data item. This background image is then used to construct an irrelevant data item from the irrelevant dataset.

[0202] Specifically, the part of the image associated with the label is the foreground region, and the other regions in the image besides the foreground region are the background regions. Therefore, irrelevant data items can be determined based solely on the background region through foreground-background separation.

[0203] In some embodiments, the data item to be processed in the dataset is a video sequence. Then, a binary image of the video sequence can be determined based on the gradient information between one frame and the previous frame. A background image of the video sequence is then generated based on the binary image. Subsequently, an irrelevant data item in the irrelevant dataset is constructed using the background image of the video sequence.

[0204] Figure 8 A schematic flowchart of a process 800 for constructing irrelevant data items according to an embodiment of the present disclosure is shown. Specifically, Figure 8 The diagram shows the process of constructing irrelevant data items based on the data items to be processed (video sequences).

[0205] like Figure 8 As shown in box 810, gradient information between two adjacent frames in the target video sequence is calculated.

[0206] For example, the gradient of the feature vectors of two frames along the time dimension can be calculated to obtain gradient information. In this way, static background parts in the video sequence, such as image borders, can be obtained.

[0207] In box 820, a gradient overlay map is obtained based on the overlay of gradient information.

[0208] For example, the gradient information obtained from 810 can be weighted and summed, or its maximum or minimum value can be calculated, to complete the superposition and obtain a gradient superposition map.

[0209] In box 830, the gradient overlay map is thresholded to obtain the initial binary map.

[0210] In box 840, morphological processing is performed on the initial binary image to obtain a binary image.

[0211] For example, the initial binary image is subjected to several morphological dilations, and then the same number of morphological erosions are performed to obtain a binary image.

[0212] In box 850, a background image is obtained based on the binary image, and the background image is used as an irrelevant data item corresponding to the video sequence.

[0213] For example, a matting operation can be performed on a binary image, such as by matrix dot product, to obtain the background image.

[0214] Thus, considering the similarity between the frames in the video sequence and the fact that the background remains basically unchanged in the video sequence, the background image corresponding to the video sequence can be obtained.

[0215] Furthermore, the label of irrelevant data items is determined based on the label of the data item to be processed. Specifically, if the target data item to be processed has label A, and an irrelevant data item is obtained by processing the target data item (such as image segmentation), then the label of the irrelevant data item is also label A.

[0216] In box 720, the irrelevant dataset is divided into a first dataset and a second dataset. The first dataset has a first sample weight distribution, and the second dataset has a second sample weight distribution. The first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the dataset to be processed.

[0217] The sample weight of irrelevant data items is determined based on the sample weight of the data items to be processed. Specifically, if the target data item to be processed has a sample weight w, and target irrelevant data items are obtained by processing the target data item (such as image segmentation), then the sample weight of the target irrelevant data item is also the sample weight w.

[0218] The method of dividing the first and second datasets in this disclosure is not limited. For example, they can be divided in a 9:1 ratio, such that the ratio of irrelevant data items in the first dataset to the number of irrelevant data items in the second dataset is approximately 9:1. Alternatively, they can be divided in a 1:1 ratio, such that the ratio of irrelevant data items in the first dataset to the number of irrelevant data items in the second dataset is approximately 1:1. Furthermore, the first dataset can be further divided into a first subset and a second subset, for example, the ratio of irrelevant data items in the first subset to the number of irrelevant data items in the second subset is approximately 7:2. It is understood that the ratios listed herein are merely illustrative and do not constitute a limitation on the embodiments of this disclosure.

[0219] In box 730, the classification model is trained based on the first dataset and the first sample weight distribution.

[0220] Specifically, the first dataset can be sampled based on the weight distribution of the first sample, and the classification model can be trained based on the labels of irrelevant data items in the first dataset.

[0221] In other words, the first dataset can be used as the training set to train the classification model. Optionally, the first dataset can be preprocessed before training, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising, etc.

[0222] The embodiments disclosed herein do not limit the specific structure of the classification model. For example, it can be a convolutional neural network, which includes at least convolutional layers and fully connected layers.

[0223] In box 740, the classification model is evaluated based on the second dataset and the second sample weight distribution to obtain an evaluation result that indicates the significance of bias in the dataset to be processed with the sample weight distribution.

[0224] In other words, the second dataset can be used as a test set to obtain the evaluation results. Specifically, the evaluation results can be obtained by comparing the prediction results of the classification model for irrelevant data items in the second dataset with the labels of the irrelevant data items in the second dataset.

[0225] As an example, the evaluation results may include the first accuracy for positive samples in the second dataset and the second accuracy for negative samples in the second dataset.

[0226] Thus, the embodiments of this disclosure can obtain a quantitative representation of the significance of bias towards the dataset to be processed by constructing an irrelevant dataset, training and evaluating based on the irrelevant dataset. This provides a quantitative bias reference, facilitating further adjustments to the dataset to be processed.

[0227] For example, if the evaluation results obtained in box 740 indicate that the bias is significant (or that there is a significant bias), then the sample weight distribution of the dataset to be processed can be updated.

[0228] In some embodiments, if the evaluation result is greater than a preset threshold, the sample weight distribution of the dataset to be processed is updated. Further, after this, the process can return to 720 to obtain the first and second datasets again, and repeat steps 730 and 740 until the evaluation result obtained in step 740 indicates that the bias is not significant (or that there is no significant bias), for example, the evaluation result is not greater than the preset threshold. Subsequently, the sample weight distribution when the evaluation result is not greater than the preset threshold can be used as the recommended sample weight distribution, and this recommended sample weight distribution is output.

[0229] The specific method for updating the sample weight distribution in this embodiment is not limited. For example, at least one of the following methods can be used to update the sample weight distribution: updating the sample weight distribution using a predetermined rule, updating the sample weight distribution using a random method, obtaining user modifications to the sample weight distribution to update the sample weight distribution, or optimizing the sample weight distribution using a genetic algorithm to update the sample weight distribution.

[0230] In some implementations of this disclosure, updating the sample weight distribution can update the first sample weight distribution of the first dataset. Thus, when returning to execute 720, the first sample weight distribution of the first dataset in the re-executed 720 is updated, and consequently the classification model trained at 730 is also updated.

[0231] In another implementation of this disclosure, updating the sample weight distribution can update the first sample weight distribution of the first dataset and update the second sample weight distribution of the second dataset. As one example, the sample weight distribution of the dataset to be processed can be updated, and irrelevant datasets can be re-split. As another example, the sample weight distribution of the dataset to be processed can be updated, thereby adaptively updating the first and second sample weight distributions, but irrelevant data items in the first and second datasets remain unchanged. Thus, when returning to execute 720, the first dataset in the re-executed 720 is updated, or the first sample weight distribution of the first dataset is updated, and consequently, the classification model trained at 730 is also updated.

[0232] In another implementation of this disclosure, updating the sample weight distribution can update the second sample weight distribution of the second dataset. Optionally, the first sample weight distribution may remain unchanged. As an example, in this implementation, when returning to execution 720, the first and second datasets from the previous execution 720 can be swapped. Thus, the first dataset when returning to execution 730 is the second dataset from the previous execution. This allows for a more comprehensive consideration of the datasets to be processed, making the classification model's evaluation of bias significance more accurate.

[0233] Figure 9 A schematic diagram of a process 900 for updating the sample weight distribution of a dataset to be processed according to an embodiment of the present disclosure is shown.

[0234] like Figure 9 As shown in box 910, an irrelevant dataset is constructed based on the dataset to be processed. The irrelevant dataset includes irrelevant data items with labels, and the labels of the irrelevant data items are determined based on the labels of the data items to be processed in the dataset to be processed.

[0235] In box 920, the irrelevant dataset is divided into a first dataset and a second dataset. The first dataset has a first sample weight distribution, and the second dataset has a second sample weight distribution. The first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the dataset to be processed.

[0236] In box 930, the classification model is trained based on the first dataset and the first sample weight distribution.

[0237] In box 940, the classification model is evaluated based on the second dataset and the second sample weight distribution to obtain an evaluation result that indicates the significance of bias in the dataset to be processed with the sample weight distribution.

[0238] about Figure 9 The numbers 910 to 940 can be referenced from the above. Figure 7 For the sake of brevity, the descriptions of 710 to 740 will not be repeated here.

[0239] exist Figure 9 In box 950, determine whether the evaluation result is greater than a preset threshold. If the evaluation result is determined to be greater than the preset threshold, proceed to step 960. If the evaluation result is determined not to be greater than the preset threshold, proceed to step 980.

[0240] In box 960, update the weight distribution of the second sample in the second dataset.

[0241] As examples, the sample weights of all irrelevant data items in the second dataset can be updated, or the sample weights of some irrelevant data items in the second dataset can be updated.

[0242] As examples, the weight distribution of the second sample can be updated based on the predictions of the classification model in 940 for irrelevant data items in the second dataset.

[0243] Specifically, the sample weights of irrelevant data items in the correctly predicted second dataset can be increased, or the sample weights of irrelevant data items in the incorrectly predicted second dataset can be decreased. For example, suppose the sample weight of the first irrelevant data item in the second dataset is 2, and the prediction result obtained by inputting the first irrelevant data item into the classification model is consistent with its label, then the sample weight of the first irrelevant data item in the second dataset can be increased, for example, from 2 to 3, 4, or other values. Conversely, suppose the sample weight of the second irrelevant data item in the second dataset is 2, and the prediction result obtained by inputting the second irrelevant data item into the classification model is inconsistent with its label, then the sample weight of the second irrelevant data item in the second dataset can be decreased, for example, from 2 to 1.

[0244] In box 970, the first dataset with a first sample weight distribution is swapped with the second dataset with an updated second sample weight distribution.

[0245] It can be understood that the first dataset after the swap is the second dataset in box 920, and the first sample weight distribution of the first dataset after the swap is the second sample weight distribution updated in box 960. The second dataset after the swap is the first dataset in box 920, and the second sample weight distribution of the second dataset after the swap is the first sample weight distribution in box 920.

[0246] After box 970, proceed to execution 930. That is, the classification model is retrained using the first dataset after the swap in 970.

[0247] In box 980, output the weight distribution of the recommended samples.

[0248] For example, the sample weight distribution when the evaluation result is no greater than a preset threshold is used as the recommended sample weight distribution. Specifically, the recommended sample weight distribution can be determined based on the first sample weight distribution and the second sample weight distribution.

[0249] In some embodiments of this disclosure, the regions of interest for dataset bias can be visualized. Specifically, target-irrelevant data items can be input into a trained classification model to obtain a class activation map. The class activation map is then overlaid with the target-irrelevant data items to obtain an overlay result, which is then displayed. As an example, this overlay result can be obtained by weighted summation of heatmaps. By displaying the overlay result, it is possible to intuitively see which regions of interest the classification model represents, and which are significant factors contributing to bias.

[0250] In some embodiments of this disclosure, after obtaining the recommended sample weight distribution, the dataset to be processed may optionally be adjusted based on the recommended sample weight distribution to obtain an unbiased dataset.

[0251] For example, an unbiased dataset can be constructed by adding or deleting data from the dataset to be processed.

[0252] As an example, data items with high recommendation sample weights can be copied to increase the number of data items in the dataset. Conversely, data items with low recommendation sample weights can be deleted to reduce the number of data items in the dataset.

[0253] As an example, a user's deletion command for some data items can be obtained to delete those items. Other data items input by the user can be obtained and added to the current dataset.

[0254] For example, users can add or delete data from the dataset to be processed based on the weight distribution of recommended samples. For instance, a user can find other samples similar to the data items to be processed with high recommended sample weights and add them as new data items to the dataset, thus supplementing the dataset. As an example, similar other samples could be other images captured by the same (or the same model) image acquisition device in similar environments (such as lighting conditions).

[0255] Thus, in the embodiments of this disclosure, the dataset to be processed can be added to or deleted based on the weight distribution of recommended samples, thereby constructing an unbiased dataset. Furthermore, this unbiased dataset can be used to train more robust and unbiased models for specific tasks.

[0256] It is understood that, in the embodiments of this disclosure, Figures 7 to 9 The process described can be referred to in conjunction with the above. Figures 1 to 6 For the sake of brevity, the functions of the modules described will not be repeated.

[0257] Figure 10 A schematic block diagram of a data processing apparatus 1000 according to an embodiment of the present disclosure is shown. The apparatus 1000 can be implemented by software, hardware, or a combination of both. In some embodiments, the apparatus 1000 can be implemented as... Figure 1 Software or hardware devices for some or all of the functions in the system 100 shown.

[0258] like Figure 10 As shown, the device 1000 includes a construction unit 1010, a partitioning unit 1020, a training unit 1030, and an evaluation unit 1040.

[0259] The building unit 1010 is configured to build an irrelevant dataset based on the dataset to be processed. The irrelevant dataset includes irrelevant data items with labels, the labels of which are determined based on the labels of the data items to be processed in the dataset to be processed.

[0260] The partitioning unit 1020 is configured to partition an irrelevant dataset into a first dataset and a second dataset. The first dataset has a first sample weight distribution, and the second dataset has a second sample weight distribution. The first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the dataset to be processed.

[0261] Training unit 1030 is configured to train the classification model based on the first dataset and the first sample weight distribution.

[0262] Evaluation unit 1040 is configured to evaluate the classification model based on a second dataset and a second sample weight distribution to obtain an evaluation result that indicates the significance of bias in the dataset to be processed with the sample weight distribution.

[0263] In some embodiments, the device 1000 may further include an update unit 1050, an adjustment unit 1060, and a display unit 1070.

[0264] The update unit 1050 is configured to update the sample weight distribution of the dataset to be processed if the evaluation result obtained by the evaluation unit 1040 is greater than a preset threshold.

[0265] As an example, the update unit 1050 can be configured to update a portion of the sample weight distribution, such that the second sample weight distribution is updated without updating the first sample weight distribution.

[0266] In some embodiments, the update unit 1050 may be configured to update the sample weight distribution by at least one of the following: updating the sample weight distribution using a predetermined rule, updating the sample weight distribution in a random manner, obtaining user modifications to the sample weight distribution to update the sample weight distribution, or optimizing the sample weight distribution using a genetic algorithm to update the sample weight distribution.

[0267] In some embodiments, the update unit 1050 can be configured to use the sample weight distribution when the evaluation result is not greater than a preset threshold as the recommended sample weight distribution.

[0268] The adjustment unit 1060 is configured to add or delete data in the dataset to be processed based on the weight distribution of the recommended samples in order to construct an unbiased dataset.

[0269] The update unit 1050 is also configured to: obtain a class activation map by inputting target-irrelevant data items into the trained classification model; and obtain a superposition result by superimposing the target-irrelevant activation map with the target-irrelevant data items.

[0270] Display unit 1070 is configured to display the weight distribution of recommended samples and / or the overlay results.

[0271] In some embodiments, the construction unit 1010 may be configured to remove the portion associated with the label of the target data item from the target data item in the dataset to be processed, so as to obtain the remaining portion in the target data item; and to use the remaining portion to construct an irrelevant data item in the irrelevant dataset, wherein the label of the irrelevant data item corresponds to the label of the target data item.

[0272] In some embodiments, the dataset to be processed is an image dataset, and the construction unit 1010 can be configured to perform image segmentation on a target data item to be processed in the dataset to obtain a background image corresponding to the target data item; and to use the background image to construct an irrelevant data item in the irrelevant dataset.

[0273] In some embodiments, the data item to be processed in the dataset is a video sequence. The construction unit 1010 can be configured to determine a binary image of the video sequence based on the gradient information between a frame of the video sequence and the previous frame of the video sequence; generate a background image of the video sequence based on the binary image; and use the background image of the video sequence to construct an irrelevant data item in the irrelevant dataset.

[0274] The unit division in the embodiments of this disclosure is illustrative and only represents one logical functional division. In actual implementation, other division methods may be used. Furthermore, the functional units in the disclosed embodiments can be integrated into a single processor, exist as separate physical units, or be integrated into a single unit. The integrated units described above can be implemented in hardware or as software functional units.

[0275] Figure 10 The data processing device 1000 shown can be used to implement the above-mentioned combination. Figures 7 to 9 The data processing procedure is shown.

[0276] This disclosure can also be implemented as a computer program product. A computer program product may include computer-readable program instructions for performing various aspects of this disclosure. This disclosure can also be implemented as a computer-readable storage medium having computer-readable program instructions stored thereon, which, when executed by a processor, cause the processor to perform the aforementioned data processing procedures.

[0277] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) or flash memory, static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage medium used herein is not to be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0278] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0279] Computer-readable program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as "C" or similar languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state information from computer-readable program instructions. These electronic circuits can execute computer-readable program instructions to implement various aspects of this disclosure.

[0280] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0281] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0282] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0283] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer-readable program instructions.

Claims

1. A data processing method, comprising: An irrelevant dataset is constructed based on the dataset to be processed. The irrelevant dataset includes irrelevant data items with labels. The labels of the irrelevant data items are determined based on the labels of the data items to be processed in the dataset to be processed. The dataset to be processed includes multiple data items to be processed. The data items to be processed in the dataset to be processed are images or video sequences. The labels are associated with the foreground region of the image or video sequence. The irrelevant data items in the irrelevant dataset are constructed based on the background image of the corresponding image or video sequence. The irrelevant dataset is divided into a first dataset and a second dataset. The first dataset has a first sample weight distribution, and the second dataset has a second sample weight distribution. The first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the dataset to be processed. The classification model is trained based on the first dataset and the first sample weight distribution; as well as The classification model is evaluated based on the second dataset and the second sample weight distribution to obtain an evaluation result, which indicates the salience of bias in the dataset to be processed with the sample weight distribution, wherein the bias represents the relationship between features of the background image and labels associated with the foreground region.

2. The method according to claim 1, further comprising: If the evaluation result is greater than a preset threshold, update the sample weight distribution of the dataset to be processed; as well as Based on the updated sample weight distribution, the training and evaluation processes are repeated until the evaluation result is no greater than the preset threshold.

3. The method according to claim 2, wherein updating the sample weight distribution includes: Update a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

4. The method according to claim 2 or 3, further comprising: The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.

5. The method according to claim 4, further comprising: Based on the recommended sample weight distribution, the dataset to be processed is added to or deleted to construct an unbiased dataset.

6. The method according to claim 2 or 3, wherein the updated sample weight distribution comprises at least one of the following: The sample weight distribution is updated using predetermined rules. The sample weight distribution is updated randomly. Obtain user modifications to the sample weight distribution to update the sample weight distribution, or The sample weight distribution is updated by optimizing the sample weight distribution using a genetic algorithm.

7. The method according to any one of claims 1 to 3, wherein constructing an irrelevant dataset based on the dataset to be processed comprises: Remove the portion associated with the label of the target data item from the target data item in the dataset to be processed, so as to obtain the remaining portion of the target data item; as well as The remaining portion is used to construct an irrelevant data item from the irrelevant dataset, the label of which corresponds to the label of the target data item to be processed.

8. The method according to any one of claims 1 to 3, wherein the dataset to be processed is an image dataset, and wherein constructing the irrelevant dataset based on the dataset to be processed comprises: Image segmentation is performed on the target data item in the dataset to be processed to obtain a background image corresponding to the target data item; as well as The background image is used to construct an irrelevant data item in the irrelevant dataset.

9. The method according to any one of claims 1 to 3, wherein the data items to be processed in the dataset to be processed are video sequences, and wherein constructing the irrelevant dataset based on the dataset to be processed comprises: Based on the gradient information between a frame in the video sequence and the previous frame, a binary image of the video sequence is determined. Based on the binary image, generate the background image of the video sequence; as well as An irrelevant data item in the irrelevant dataset is constructed using the background image of the video sequence.

10. The method according to any one of claims 1 to 3, further comprising: By inputting target-irrelevant data items into a trained classification model, a class activation map (CAM) is obtained. The superposition result is obtained by superimposing the CAM with the target-irrelevant data items; as well as The overlay result is displayed.

11. A data processing apparatus, comprising: The construction unit is configured to construct an irrelevant dataset based on a dataset to be processed. The irrelevant dataset includes irrelevant data items with labels. The labels of the irrelevant data items are determined based on the labels of the data items to be processed in the dataset to be processed. The dataset to be processed includes multiple data items to be processed. The data items to be processed in the dataset to be processed are images or video sequences. The labels are associated with the foreground region of the image or the video sequence. The irrelevant data items in the irrelevant dataset are constructed based on the background image of the corresponding image or the video sequence. A partitioning unit is configured to partition the irrelevant dataset into a first dataset and a second dataset, the first dataset having a first sample weight distribution and the second dataset having a second sample weight distribution, the first sample weight distribution and the second sample weight distribution being determined based on the sample weights of the data items to be processed in the dataset to be processed; The training unit is configured to train the classification model based on the first dataset and the first sample weight distribution; as well as An evaluation unit is configured to evaluate the classification model based on the second dataset and the second sample weight distribution to obtain an evaluation result indicating the salience of bias in the dataset to be processed with the sample weight distribution, wherein the bias represents the relationship between features of the background image and labels associated with the foreground region.

12. The apparatus of claim 11, further comprising an updating unit configured to: If the evaluation result is greater than a preset threshold, the sample weight distribution of the dataset to be processed is updated.

13. The apparatus of claim 12, wherein the updating unit is configured to: Update a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.

14. The apparatus of claim 12 or 13, wherein the updating unit is configured to: The sample weight distribution when the evaluation result is not greater than the preset threshold is used as the recommended sample weight distribution.

15. The apparatus of claim 14, further comprising an adjustment unit configured to: Based on the recommended sample weight distribution, the dataset to be processed is added to or deleted to construct an unbiased dataset.

16. The apparatus of claim 12 or 13, wherein the updating unit is configured to update the sample weight distribution by at least one of the following: The sample weight distribution is updated using predetermined rules. The sample weight distribution is updated randomly. Obtain user modifications to the sample weight distribution to update the sample weight distribution, or The sample weight distribution is updated by optimizing the sample weight distribution using a genetic algorithm.

17. The apparatus according to any one of claims 11 to 13, wherein the building unit is configured to: Remove the portion associated with the label of the target data item from the target data item in the dataset to be processed, so as to obtain the remaining portion of the target data item; and The remaining portion is used to construct an irrelevant data item from the irrelevant dataset, the label of which corresponds to the label of the target data item to be processed.

18. The apparatus according to any one of claims 11 to 13, wherein the dataset to be processed is an image dataset, and wherein the building unit is configured to: Image segmentation is performed on the target data item in the dataset to be processed to obtain a background image corresponding to the target data item; and The background image is used to construct an irrelevant data item in the irrelevant dataset.

19. The apparatus according to any one of claims 11 to 13, wherein the data items to be processed in the dataset to be processed are video sequences, and wherein the building unit is configured to: Based on the gradient information between a frame in the video sequence and the previous frame, a binary image of the video sequence is determined. Based on the binary image, generate the background image of the video sequence; and An irrelevant data item in the irrelevant dataset is constructed using the background image of the video sequence.

20. The apparatus according to any one of claims 11 to 13, further comprising: The update unit is configured to obtain a class activation map (CAM) by inputting target-irrelevant data items into a trained classification model. And by superimposing the CAM with the target-irrelevant data items, a superposition result is obtained; as well as The display unit is configured to display the superposition result.

21. A computing device, characterized in that, The device includes a processor and a memory, the processor reading and executing a computer program stored in the memory, causing the computing device to perform the method according to any one of claims 1 to 10.

22. A computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the method according to any one of claims 1 to 10.