A data mining method, device and electronic equipment
By extracting features and determining categories from unlabeled image samples and adjusting the sample selection amount, the problem of difficulty in identifying rare samples in active learning methods is solved, and the discovery and accurate prediction of difficult cases are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2022-09-28
- Publication Date
- 2026-06-23
AI Technical Summary
Existing active learning methods struggle to identify rare and difficult examples, making it difficult to train deep learning models that can accurately predict difficult examples.
By extracting features from unlabeled image samples to determine their categories, and adjusting the selection based on the category and sample selection quantity, the corresponding unlabeled image samples are selected, thus achieving the mining of difficult cases.
Even if there are unlabeled rare samples, they can be discovered, achieving the discovery of hard examples, which can then be used to train deep learning models to correctly predict rare image categories.
Smart Images

Figure CN115578582B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a data mining method, apparatus, and electronic device. Background Technology
[0002] To ensure the reliability of deep learning models trained for tasks such as image classification (e.g., vehicle brand classification), a large number of manually labeled training samples are currently required. However, manually labeling training samples consumes significant human resources. To avoid this waste, active learning methods are often used to discover valuable samples, achieving better learning performance with less manual labeling.
[0003] However, existing active learning methods struggle to identify rare samples (collectively known as hard examples), making it difficult to train deep learning models that can accurately predict hard examples. Summary of the Invention
[0004] In view of this, embodiments of this application provide a data mining method, apparatus, and electronic device to achieve the mining of difficult cases.
[0005] According to a first aspect of the embodiments of this application, a data mining method is provided, the method being applied to an electronic device, comprising the following steps:
[0006] Feature extraction is performed on the unlabeled image samples to obtain the image features of the unlabeled image samples;
[0007] Based on the image features of each unlabeled image sample, determine the category corresponding to each unlabeled image sample;
[0008] If at least two categories have been identified and are in the same target category group, then for each category in the target category group that does not meet the following condition: the number of unlabeled image samples belonging to the category is less than the sample selection amount corresponding to the category, the sample selection amount corresponding to the category and the number of unlabeled image samples belonging to the category are adjusted according to the sample selection amount corresponding to the category and the number of unlabeled image samples belonging to the category, and the sample selection amount of other categories in the target category group that meet the condition are also adjusted.
[0009] Based on the adjusted sample selection quantity corresponding to each category in the target category group, select the target unlabeled image sample corresponding to the sample selection quantity from all unlabeled image samples belonging to that category.
[0010] According to a second aspect of the embodiments of this application, a data mining apparatus is provided, comprising:
[0011] The feature extraction module is used to extract features from unlabeled image samples to obtain the image features of the unlabeled image samples;
[0012] The category determination module is used to determine the category of each unlabeled image sample based on its image features.
[0013] The sample selection quantity adjustment module is used to adjust the sample selection quantity of the target category group and the sample selection quantity of other categories in the target category group that meet the following conditions if at least two categories have been determined to be in the same target category group: the number of unlabeled image samples belonging to the category is less than the sample selection quantity corresponding to the category.
[0014] The mining module is used to select target unlabeled image samples corresponding to the adjusted sample selection quantity from all unlabeled image samples belonging to the target category group.
[0015] According to a third aspect of the embodiments of this application, an electronic device is provided, the electronic device including a machine-readable storage medium and a processor; the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; the processor is configured to read the machine-executable instructions to implement the steps of the data mining method as described in the first aspect or any optional embodiment of the first aspect.
[0016] The technical solutions provided in this application embodiment may include the following beneficial effects:
[0017] In this embodiment of the application, the sample selection amount of the category that does not meet the conditions in the target category group and the sample selection amount of the other categories of unlabeled image samples that meet the conditions in the target category group are adjusted by relying on the category corresponding to each unlabeled image sample and the sample selection amount corresponding to each category. In order to select the target unlabeled image sample corresponding to the sample selection amount from all unlabeled image samples belonging to the category according to the sample selection amount corresponding to each category in the target category group, this enables the mining of unlabeled rare samples (collectively referred to as hard examples) even if there are unlabeled rare samples (collectively referred to as hard examples).
[0018] Furthermore, since difficult examples can also be mined according to the method provided in the embodiments of this application, the mined difficult examples can be used to train a deep learning model to correctly predict difficult examples, such as rare image categories. Attached Figure Description
[0019] Figure 1 This is a flowchart of the method provided in the embodiments of this application.
[0020] Figure 2 This is a feature extraction flowchart provided in an embodiment of this application.
[0021] Figure 3 This is a flowchart of the category determination provided in the embodiments of this application.
[0022] Figure 4 This is another category determination flowchart provided in the embodiments of this application.
[0023] Figure 5 This is a flowchart of the sample selection quantity adjustment provided in the embodiments of this application.
[0024] Figure 6 This is a flowchart of the classification model optimization provided in the embodiments of this application.
[0025] Figure 7 This is a diagram of the apparatus provided in the embodiments of this application.
[0026] Figure 8 This is a schematic diagram of the hardware structure of the device according to an embodiment of this application. Detailed Implementation
[0027] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0028] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.
[0029] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."
[0030] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, and to make the above-mentioned objectives, features and advantages of the embodiments of this application more apparent and understandable, the technical solutions in the embodiments of this application will be further described in detail below with reference to the accompanying drawings.
[0031] The data mining method provided in this embodiment can proactively discover rare class samples and other difficult examples, thereby solving the problem that existing active learning methods struggle to identify rare samples and other difficult examples. The method provided in this embodiment is described below:
[0032] See Figure 1 , Figure 1 This is a flowchart illustrating a method provided in an embodiment of this application. The method is applied to an electronic device; as an example, the electronic device may be a terminal, a server, etc., but this embodiment is not specifically limited to this.
[0033] like Figure 1 As shown, the process may include the following steps:
[0034] S110: Extract features from unlabeled image samples to obtain image features of unlabeled image samples.
[0035] For example, the unlabeled image samples mentioned above can be vehicle image samples, home appliance image samples, etc. The embodiments of this application do not specifically limit the unlabeled image samples mentioned above.
[0036] For example, in this embodiment, feature extraction of unlabeled image samples in step S110 can be performed using a neural network model. How to use a neural network model for feature extraction will be discussed below. Figure 2 Examples have been provided, so they will not be repeated here.
[0037] For example, in this embodiment, the feature extraction of unlabeled image samples in step S110 can also be performed using principal component analysis, linear discriminant analysis, or local preserving mapping, etc., and this embodiment is not specifically limited. Here, principal component analysis, linear discriminant analysis, and local preserving mapping are all conventional technologies and will not be described in detail here.
[0038] S120: Determine the category corresponding to each unlabeled image sample based on its image features.
[0039] For example, in this embodiment, the image feature is the image feature corresponding to an object in an unlabeled image sample, which may specifically include: color features, texture features, frame features, etc. This application embodiment does not specifically limit this. For vehicle classification scenarios, the image feature may include: vehicle taillight position, vehicle color features, vehicle brand features, etc.
[0040] For example, in this embodiment, before executing step S120, if only the first category group is currently obtained, the first category group may include at least one category corresponding to the labeled image samples. These labeled image samples are the samples used by the trained classification model during training. For example, for vehicle brand classification, the category corresponding to the labeled image samples could be the vehicle brand corresponding to the currently obtained image samples, such as the common Volkswagen and Audi. Under this premise, in step S120, determining the category corresponding to each unlabeled image sample based on its image features can be referred to the following text. Figure 3 The process described in the example is not repeated here. The process of training a classification model using labeled image samples can be found in the example described below, which is also not repeated here.
[0041] For example, in this embodiment, before executing step S120, if a first category group and a second category group have already been obtained, the second category group may contain at least one category other than the category corresponding to the labeled image sample. Taking vehicle brand classification as an example again, the categories in the second category group may be categories not present in the first category group, such as newly launched vehicle brands. Based on the obtained first and second category groups, Figure 4 An example was given to illustrate how to determine the category of each unlabeled image sample based on its image features; this will not be elaborated upon here.
[0042] S130: If at least two categories have been identified and are in the same target category group, then for each category in the target category group that does not meet the following condition: the number of unlabeled image samples belonging to the category is less than the sample selection amount corresponding to the category, adjust the sample selection amount corresponding to the category and the sample selection amount of other categories in the target category group that meet the condition based on the sample selection amount corresponding to the category and the number of unlabeled image samples belonging to the category.
[0043] For example, in this embodiment, the target category group includes at least one category. Here, the target category group can be a first target category group or a second target category group, and this embodiment does not specifically limit it. The first category group includes at least one category corresponding to the labeled image samples, where the labeled image samples are samples used by the trained classification model during training. The second category group includes at least one category other than the category corresponding to the labeled image samples.
[0044] For example, in this embodiment, it is necessary to determine the sample selection quantity corresponding to each category in the target category group in advance based on the total sample selection quantity. The total sample selection quantity is less than or equal to the total number of unlabeled image samples. A method to determine the sample selection quantity corresponding to each category in the target category group can be to distribute the total sample selection quantity evenly among the categories in the target category group. For example, when the total sample selection quantity is 200 and there are 4 categories in the target category group, then the sample selection quantity corresponding to each category in the target category group is 50.
[0045] For example, in this embodiment, the method for determining the sample selection quantity corresponding to each category in the target category group can also be as follows: the total sample selection quantity is allocated to each category in the target category group according to the proportion of the number of unlabeled image samples belonging to each category in the target category group. For example, when the total sample selection quantity is 300 and there are 4 categories in the target category group, the above 4 categories are category A, category B, category C and category D. Among them, the number of unlabeled image samples belonging to category A is 100, the number of unlabeled image samples belonging to category B is 150, the number of unlabeled image samples belonging to category C is 150, and the number of unlabeled image samples belonging to category D is 200. Then the proportion of the number of unlabeled image samples belonging to each category is 2:3:3:4. Then the sample selection quantity corresponding to category A in the target category group is 50, the sample selection quantity corresponding to category B is 75, the sample selection quantity corresponding to category C is 75, and the sample selection quantity corresponding to category D is 100.
[0046] It should be noted that when the number of samples to be selected for each category is not an integer, an integer can be used as the number of samples to be selected for that category.
[0047] In step S130, for each category in the target category group that does not meet the conditions, the sample selection quantity for that category is adjusted based on the sample selection quantity for that category and the number of unlabeled image samples belonging to that category. The sample selection quantities for other categories in the target category group that meet the conditions are also adjusted (see below). Figure 5 The process described in the example will not be repeated here.
[0048] S140: Based on the adjusted sample selection quantity corresponding to each category in the target category group, select the target unlabeled image sample corresponding to the sample selection quantity from all unlabeled image samples belonging to that category.
[0049] For example, when the target category group is the first category group, the target unlabeled image samples corresponding to the sample selection quantity are mined in this step S140. This can be done through random mining, mining based on the sample uncertainty set for each unlabeled image sample in each category, etc., and this embodiment is not specifically limited to these methods. The specific process of mining based on the sample uncertainty set for each unlabeled image sample in each category is described in the following embodiments and will not be repeated here. Here, the sample uncertainty is used to represent the classification performance of the classification model on the unlabeled image samples.
[0050] For example, in this embodiment, when the target category group is the second category group, the target unlabeled image samples corresponding to the sample selection quantity are mined in this step S140. This can be done through random mining, mining based on the kmeans++ algorithm, etc., and this embodiment is not specifically limited. Here, mining based on the kmeans++ algorithm is a conventional technique and will not be described in detail.
[0051] For example, in this embodiment, when the target category is the second category group, the target unlabeled image samples corresponding to the sample selection quantity are mined in this step S140. The target unlabeled image samples can also be mined using a custom mining method. For details on how to use a custom mining method for mining, please refer to the following embodiments, which will not be elaborated here.
[0052] This concludes the process. Figure 1 The process is shown below.
[0053] pass Figure 1 As can be seen from the process shown, in this embodiment, by relying on the category corresponding to each unlabeled image sample and the sample selection amount corresponding to each category, the sample selection amount of the category that does not meet the conditions in the target category group and the sample selection amount corresponding to the other categories of unlabeled image samples that meet the conditions in the target category group are adjusted. In order to select the target unlabeled image sample corresponding to the sample selection amount from all unlabeled image samples belonging to the category according to the sample selection amount corresponding to each category in the target category group, this enables the discovery of even unlabeled rare samples (collectively referred to as hard examples).
[0054] The following describes how, in step S110, features are extracted from unlabeled image samples based on a neural network model (taking a classification model as an example) to obtain the image features of the unlabeled image samples:
[0055] See Figure 2 , Figure 2 This is a flowchart illustrating the feature extraction process provided in an embodiment of this application. Figure 2 As shown, the process may include the following steps:
[0056] S210: Input the unlabeled image samples into the trained classification model so that the feature extraction layer in the classification model can extract features from the unlabeled image samples and output the image features of the unlabeled image samples.
[0057] S220: Obtain the image features of the unlabeled image samples output by the classification model.
[0058] For example, in this embodiment, the classification model may include a feature extraction layer, which is used to extract features from the input unlabeled image samples. In this embodiment, the feature extraction layer can be implemented through a multilayer perceptron mechanism, a self-attention mechanism, etc., and this embodiment does not specifically limit the above-mentioned feature extraction layer.
[0059] Optionally, the above classification model, in addition to the feature extraction layer, also includes a classification layer. This classification layer is used to determine the confidence level of an unlabeled image sample belonging to each category based on the image features of the unlabeled image samples extracted by the feature extraction layer. In this embodiment, the classification layer can be implemented using a fully connected layer, Arcface, etc., and this application embodiment does not specifically limit the classification layer. How to train the classification model will be described below and will not be elaborated here.
[0060] This concludes the process. Figure 2 The process is shown below.
[0061] pass Figure 2 The process shown demonstrates how to extract features from each unlabeled image sample to obtain the image features of each unlabeled image sample.
[0062] The following describes how to train a classification model:
[0063] In this embodiment, the training method for the above classification model is as follows:
[0064] Step a: Obtain labeled image samples.
[0065] For example, the above-mentioned labeled image samples can be data from an existing dataset (e.g., the ImageNet dataset), or they can be images collected according to a specific task and then labeled. This application embodiment does not specifically limit the above-mentioned labeled image samples.
[0066] For example, for a vehicle classification task, the existing dataset mentioned above could be the BIT vehicle dataset. For images collected according to the specific task, the images captured from the camera are manually labeled and stored locally, and can be directly retrieved from the local machine when needed.
[0067] As an optional implementation of this application, in order to further improve the model's generalization ability and reduce the computational cost of each training process, the labeled image samples can be randomly sampled. Specifically, at least two categories of labeled image samples can be randomly obtained each time, with each category including at least two labeled image samples. This application does not impose specific limitations on the number of randomly sampled categories or the number of labeled image samples in each category; these can be determined according to actual circumstances.
[0068] As another optional implementation of this application, in order to enrich the data, the labeled image samples obtained by random sampling can also be subjected to data enhancement processing. The data enhancement processing can include: upsampling, downsampling, color dithering, random noise, etc. The embodiments of this application do not specifically limit the above data enhancement processing.
[0069] In this embodiment of the application, in order to further enhance the model's ability to discover new category samples, data augmentation methods such as Mixup or Cutmix, which alias two labeled image samples, can be used. Specifically, the image aliasing can be performed by multiplying the pixel value of each pixel in the first labeled image sample by 0.5, multiplying the pixel value of each pixel in the second labeled image sample by 0.5, and then adding the results to form a new image sample.
[0070] Step b: Input the labeled image samples into the neural network model. The neural network model performs feature extraction and classification on the labeled image samples to obtain the image features of the labeled image samples and the confidence scores of the labeled image samples belonging to each category in the first category group. The first category group contains at least one category corresponding to the labeled image samples.
[0071] For example, the neural network module described above may include a feature extraction layer and a classification layer. The specific implementation process of the feature extraction layer and classification layer can be found in the relevant descriptions of the feature extraction layer and classification layer in the above classification model, and will not be repeated here.
[0072] Step c: Calculate the loss value based on the confidence level of the labeled image sample belonging to each category in the first category group.
[0073] For example, the above loss value can be the cross-entropy loss value, and the embodiments of this application do not specifically limit the loss value.
[0074] When the loss value is the cross-entropy loss value, for each labeled image sample, the loss value of the labeled image sample is calculated based on the confidence level of the labeled image sample belonging to each category in the first category group. Specifically, it can be done as follows:
[0075]
[0076] Where H(X) represents the loss value; p(x) i ) represents the confidence level that the labeled image sample belongs to the i-th category; q(x) i ) represents the confidence level that the labeled image sample belongs to the i-th category; n represents the number of categories in the first category group.
[0077] The average of the loss values of multiple labeled image samples is used as the final loss value.
[0078] Step d: Adjust the parameters of the neural network model based on the loss value and using the backpropagation method until the set parameter conditions are met, and obtain the classification model.
[0079] For example, the above-mentioned adjustment of the parameters of the neural network model using the backpropagation method can specifically be to use optimizers such as SGD and Adam to update the parameters of the neural network model. The embodiments of this application do not specifically limit the optimizer, but the specific adjustment method is a conventional technology and will not be described in detail here.
[0080] The above-mentioned parameter setting conditions can be that the number of iterations reaches a preset number (e.g., 200 times) or that the loss value is less than a preset threshold (e.g., 0.08). This application embodiment does not specifically limit the above-mentioned parameter setting conditions, which can be determined according to actual circumstances.
[0081] As an optional implementation of this application, when mining category samples of the second category group, a contrast loss value can also be calculated based on the image features of the labeled image samples. This contrast loss value can be a triplet loss, specifically calculated as follows:
[0082] L=max(d(a,p)-d(a,n)+margin,0)
[0083] Where 'a' represents the anchor example, 'p' represents the positive example (i.e., an labeled image sample of the same category as 'a'), 'n' represents the negative example (i.e., an labeled image sample of a different category than 'a'), and 'margin' is a constant greater than 0. Similarity calculation between samples is achieved by optimizing the distances between the anchor example and the positive example, and between the anchor example and the negative example.
[0084] In this embodiment, for any labeled image sample, the labeled image sample is designated as an anchor example, labeled image samples of the same category as the labeled image sample are designated as positive examples, and labeled image samples of different categories are designated as negative examples. The contrastive loss value of the labeled image sample is calculated using the formula described above. The average of the contrastive loss values of multiple labeled image samples is taken as the final contrastive loss value. The parameters of the neural network model are optimized based on the aforementioned loss value and the contrastive loss value.
[0085] The embodiments of this application optimize the parameters of the neural network model by using both loss value and contrastive loss value, which can simultaneously take into account existing categories and new categories, thereby improving the classification performance of the model.
[0086] As another optional implementation of this application, during the training process, the image features of each category in the first category group can also be updated. For example, the image features of each category in the first category group can be updated according to the exponential moving average (EMA) method. Specifically:
[0087] EMA=α*Pricetoday+(1-α)*EMAyesterday
[0088] Where EMA represents the updated image features; EMAyesterday represents the image features before the update; Pricetoday represents the image features output by the classification model; and α is a user-defined constant that can take values between 0.9 and 0.99.
[0089] The above are merely examples of training classification models, and this application does not limit this.
[0090] The following is through Figure 3 and Figure 4 Describe in step S120 how to determine the category of each unlabeled image sample based on its image features:
[0091] When the electronic device currently only has access to the first category group, the method for determining the category corresponding to each unlabeled image sample based on its image features can be found in [reference needed]. Figure 3 , Figure 3 A flowchart for determining the category of unlabeled image samples provided in the embodiments of this application.
[0092] like Figure 3 As shown, the process may include the following steps:
[0093] S310: For each unlabeled image sample, based on the image features of the unlabeled image sample, determine the confidence level of the unlabeled image sample belonging to each category in the obtained first category group; the first category group includes at least one category corresponding to the labeled image sample, and the labeled image sample is the sample used by the trained classification model during the training process.
[0094] For example, the above-mentioned determination of the confidence level of an unlabeled image sample belonging to each category in the obtained first category group can specifically be to input the image features of the unlabeled image sample into the classification model, and the classification layer of the classification model determines the confidence level of the unlabeled image sample belonging to each category in the first category group.
[0095] S320: Based on the confidence level that the unlabeled image sample belongs to each category in the first category group, determine the category corresponding to the unlabeled image sample; the target category group is the first category group.
[0096] For example, after obtaining the confidence level of the unlabeled image sample belonging to each category in the first category group, the category corresponding to the maximum confidence level is determined, and the category corresponding to the maximum confidence level is determined as the category of the unlabeled image sample.
[0097] This concludes the process. Figure 3 The process is shown below.
[0098] pass Figure 3 The process shown implements a method for determining the category of each unlabeled image sample based on the image features of each unlabeled image sample when the electronic device currently only has the first category group.
[0099] When the electronic device currently acquires the first category group and the second category group, the method for determining the category corresponding to each unlabeled image sample based on the image features of each unlabeled image sample can be found in [reference needed]. Figure 4 , Figure 4 A flowchart for determining the category of another unlabeled image sample provided in this application embodiment.
[0100] like Figure 4 As shown, the process may include the following steps:
[0101] S410: For each unlabeled image sample, calculate the distance between the image features of the unlabeled image sample and the image features of each category in the first category group, and calculate the distance between the image features of the unlabeled image sample and the image features of each category in the second category group; the first category group contains at least one category corresponding to the labeled image sample, and the second category group contains at least one category other than the category corresponding to the labeled image sample, and the labeled image sample is the sample used by the trained classification model during the training process.
[0102] For example, the second category group mentioned above can contain multiple categories, each corresponding to an image feature. The image feature of a category can be obtained by inputting an image sample example of that category into the classification model, where the feature extraction layer of the classification model extracts features from the image sample example of that new category.
[0103] For example, in this embodiment, the above distance can be cosine distance, Euclidean distance, Hamming distance, etc. This application embodiment does not specifically limit the distance, but determines it according to the actual situation.
[0104] In this embodiment of the application, calculating the distance between the image features of the unlabeled image sample and the image features of each category in the first category group, and calculating the distance between the image features of the unlabeled image sample and the image features of each category in the second category group, can be done by first determining which of the above-mentioned distances it is, and then calculating it according to the calculation method corresponding to the determined distance type. The calculation methods of the above-mentioned cosine distance, Euclidean distance and Hamming distance are all conventional technologies and will not be described in detail here.
[0105] S420: Determine the category corresponding to the unlabeled image sample based on the distance between the image features of the unlabeled image sample and the image features of each category in each category group.
[0106] For example, after obtaining the distances between the image features of the unlabeled image sample and the image features of each category in each category group, if the distance between the image features of the unlabeled image sample and the image features of a category in the second category group is less than the distance between the image features of the unlabeled image sample and the image features of each category in the first category group, then a category is selected from the second category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest. Otherwise, a category is selected from the first category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest, or the confidence level corresponding to the selected category is the largest.
[0107] This concludes the process. Figure 4 The process is shown below.
[0108] pass Figure 4 The process shown implements a method for determining the category of each unlabeled image sample based on its image features when the electronic device currently acquires the first category group and the second category group.
[0109] The following describes how, in step S130, the sample selection quantity corresponding to the category is adjusted based on the sample selection quantity corresponding to the category and the number of unlabeled image samples belonging to the category, and the sample selection quantity of other categories in the target category group that meet the conditions are also adjusted:
[0110] See Figure 5 , Figure 5 A flowchart illustrating the sample selection quantity adjustment process provided in this application embodiment. Figure 5 As shown, the process may include the following steps:
[0111] S510: For each category in the target category group that does not meet the conditions, determine the difference N between the number of samples selected for that category and the number of unlabeled image samples belonging to that category.
[0112] S520: Adjust the sample selection quantity for this category to the number of unlabeled image samples belonging to this category.
[0113] S530: Based on the difference N, determine the sample selection increment to be allocated to other categories that meet the conditions in the target category group, so as to adjust the sample selection increment of each category corresponding to the newly allocated sample selection increment. The sum of the sample selection increments of other categories that meet the conditions is the difference N.
[0114] For example, for categories in the target category group that do not meet the above conditions, the sample selection quantity for that category is adjusted to the number of unlabeled image samples belonging to that category. For instance, if the number of unlabeled image samples belonging to that category is 5 and the sample selection quantity for that category is 10, the sample selection quantity for that category is adjusted to 5.
[0115] For example, for other categories that meet the conditions, the sample selection quantity increment for each other category is determined based on the difference N. The specific determination method can be to distribute the difference N equally to each other category, or to distribute the difference N according to the proportion of the number of unlabeled image samples belonging to each other category. The specific distribution method is the same as the distribution method of the total sample selection quantity mentioned above, and will not be repeated here.
[0116] Once the sample selection increment for each other category is determined, the sum of this increment and the sample selection increment allocated to that category is used as the new sample selection increment for that category. Then, it is determined whether each category in the target category group meets the above conditions. If not, adjustments are made according to steps S510-S530 until all categories in the target category group meet the above conditions. If they do, the following steps are executed.
[0117] This concludes the process. Figure 5 The process is shown below.
[0118] pass Figure 5 The process shown demonstrates how to adjust the sample selection quantity for a given category based on the sample selection quantity for that category and the number of unlabeled image samples belonging to that category, as well as the sample selection quantity for other categories within the target category group that meet the conditions.
[0119] The following describes how, in step S140, the target unlabeled image sample corresponding to the adjusted sample selection quantity for each category in the target category group is selected from all unlabeled image samples belonging to that category:
[0120] As an optional implementation of this application, when the target category group is the first category group, the specific process of mining unlabeled image samples based on the sample uncertainty set for each unlabeled image sample in each category is as follows:
[0121] First, for each category in the target category group, obtain the sample uncertainty set for each unlabeled image sample belonging to that category; the sample uncertainty of the unlabeled image sample is used to indicate the accuracy of the classification model trained on that unlabeled image sample.
[0122] For example, before performing this step, the sample uncertainty for each unlabeled image sample belonging to each category in the target category group can be set. Here, the sample uncertainty is used to indicate the accuracy of the training classification model in classifying the unlabeled image sample. That is, the higher the sample uncertainty, the lower the accuracy of the training classification model in classifying the unlabeled image sample, and the lower the sample uncertainty, the higher the accuracy of the training classification model in classifying the unlabeled image sample.
[0123] As an example, for a category within a target category group, the sample uncertainty of each unlabeled image sample in that category can be represented by the entropy of the confidence that each unlabeled image sample belongs to any category within the target category group. This confidence entropy can be expressed by the following formula:
[0124] H(p)=-∑ x p(x)logp(x)
[0125] Wherein, H(p) is the entropy of the confidence of the unlabeled image sample; p(x) is the confidence of the unlabeled image sample belonging to each category in the target category group.
[0126] As another embodiment, the sample uncertainty can also be the mutual information of the confidence that each unlabeled image sample in the category belongs to each category in the target category group, or it can be the difference between the maximum confidence and the second maximum confidence that each unlabeled image sample in the category belongs to each category in the target category group. The embodiments of this application do not specifically limit the above-mentioned method for determining the sample uncertainty, and those skilled in the art can determine it according to the actual situation.
[0127] Secondly, based on the adjusted sample selection quantity corresponding to the category and the sample uncertainty of each unlabeled image sample, the target unlabeled image sample is selected from all unlabeled image samples belonging to the category; the number of target unlabeled image samples corresponds to the adjusted sample selection quantity corresponding to the category.
[0128] For example, after obtaining the sample uncertainty corresponding to each unlabeled image sample in the category, for each category in the target category group, based on the adjusted sample selection amount corresponding to the category and the sample uncertainty set for each unlabeled image sample in the category, target unlabeled image samples corresponding to the sample selection amount are selected from all unlabeled image samples belonging to the category. For example, target unlabeled image samples corresponding to the sample selection amount are selected from large to small based on the sample uncertainty.
[0129] As an optional implementation of this application, when the target category group is the second category group, the specific process of mining unlabeled image samples based on the custom mining method is as follows:
[0130] In this embodiment of the application, for each category in the second category group, the complete set of unlabeled image samples contained in the category is O, and the subset selected from the category is A. The unlabeled image samples selected from the unlabeled image samples contained in the category should cover as many features as possible of the category. Specifically:
[0131]
[0132] Where D represents the distance between i and j, i represents the image features in the whole set, j represents the image features in the subset, and ||ij||2 represents the 2-norm of ij.
[0133] The objective of selecting a corresponding number of unlabeled image samples from all unlabeled image samples included in a category is to minimize D. To achieve this minimization, greedy algorithms, genetic algorithms, etc., can be employed. This application does not specifically limit the method for obtaining the minimum D; those skilled in the art can determine the appropriate method based on the actual situation. Here, greedy algorithms and genetic algorithms are commonly used algorithms and will not be elaborated upon further.
[0134] In this embodiment of the application, when mining unlabeled image samples of new categories, unlabeled image samples belonging to each new category can be obtained through examples of new categories. Then, the sample selection quantity corresponding to each new category is determined, ensuring that image samples can be mined from each new category. When mining image samples from each new category, the selected unlabeled image samples should cover all features of that category as much as possible, ensuring the diversity of sample selection and better learning effect of new categories.
[0135] As an optional implementation of this application, after mining the target unlabeled image samples corresponding to the sample selection quantity, the above classification model is optimized based on the mined target unlabeled image samples. The optimization process is described below:
[0136] See Figure 6 , Figure 6A flowchart illustrating the classification model optimization process provided in this application embodiment. Figure 6 As shown, the process may include the following steps:
[0137] S610: When an unlabeled image sample of the target is detected to be labeled, the unlabeled image sample of the target is identified as a new labeled image sample.
[0138] S660: Optimize the classification model based on new labeled image samples.
[0139] For example, after discovering the target unlabeled image sample, the target unlabeled image sample can be labeled. The labeling method for the target unlabeled image sample can be manual labeling. When the electronic device detects that the target unlabeled image sample carries labeling information, the target unlabeled image sample is identified as a new labeled image sample. The classification model is optimized based on the new labeled image sample together with the previous labeled image samples, or the classification model is optimized based solely on the new labeled image sample. The specific optimization method is the same as the training method of the classification model mentioned above, and will not be repeated here.
[0140] This concludes the process. Figure 6 The process is shown below.
[0141] pass Figure 6 The process shown optimizes the classification model, resulting in better classification performance.
[0142] The methods provided in the embodiments of this application have been described above. The apparatus provided in the embodiments of this application is described below:
[0143] See Figure 7 , Figure 7 A structural diagram of a device provided for an embodiment of this application. The device may include:
[0144] The feature extraction module is used to extract features from unlabeled image samples to obtain the image features of the unlabeled image samples;
[0145] The category determination module is used to determine the category of each unlabeled image sample based on its image features.
[0146] The sample selection quantity adjustment module is used to adjust the sample selection quantity of the target category group and the sample selection quantity of other categories in the target category group that meet the conditions if at least two categories have been determined to be in the same target category group. If the number of unlabeled image samples belonging to the category is less than the sample selection quantity corresponding to the category, the module adjusts the sample selection quantity corresponding to the category and the number of unlabeled image samples belonging to the category.
[0147] The mining module is used to select target unlabeled image samples from all unlabeled image samples belonging to a target category based on the adjusted sample selection quantity corresponding to each category in the target category group.
[0148] As an optional implementation of this application, the above-mentioned category determination module is specifically used to: for each unlabeled image sample, determine the confidence level of the unlabeled image sample belonging to each category in the obtained first category group based on the image features of the unlabeled image sample; the first category group includes at least one category corresponding to the labeled image sample, and determine the category corresponding to the unlabeled image sample based on the confidence level of the unlabeled image sample belonging to each category in the first category group, wherein the labeled image sample is the sample used by the trained classification model during the training process;
[0149] The target category group is the first category group.
[0150] As an optional implementation of this application, the above determination of the confidence level of the unlabeled image sample belonging to each category in the first category group is performed under the premise that the electronic device has not yet obtained the second category group; the second category group includes at least one category other than the category corresponding to the labeled image sample.
[0151] As an optional implementation of this application embodiment, the above-mentioned category determination module is further used for:
[0152] For each unlabeled image sample, the distance between the image features of the unlabeled image sample and the image features of each category in the first category group is calculated, and the distance between the image features of the unlabeled image sample and the image features of each category in the second category group is also calculated. The first category group contains at least one category corresponding to the labeled image sample, and the second category group contains at least one category other than the category corresponding to the labeled image sample. The labeled image sample is the sample used by the trained classification model during the training process.
[0153] The category corresponding to the unlabeled image sample is determined based on the distance between the image features of the unlabeled image sample and the image features of each category in each category group.
[0154] As an optional implementation of this application, the above-mentioned determination of the category corresponding to the unlabeled image sample based on the distance between the image features of the unlabeled image sample and the image features of each category in each category group includes:
[0155] If the distance between the image features of the unlabeled image sample and the image features of a category in the second category group is less than the distance between the image features of the unlabeled image sample and the image features of each category in the first category group, then a category is selected from the second category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest; otherwise, a category is selected from the first category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest.
[0156] The target category group is either the second category group or the first category group.
[0157] As an optional embodiment of this application, the target category group is the first category group; the above-mentioned mining module is specifically used for:
[0158] For each category in the target category group, obtain the sample uncertainty set for each unlabeled image sample belonging to that category; the sample uncertainty of the unlabeled image sample is used to indicate the accuracy of the classification model trained on the unlabeled image sample.
[0159] Based on the adjusted sample selection quantity corresponding to the category and the sample uncertainty of each unlabeled image sample, the target unlabeled image sample is selected from all unlabeled image samples belonging to the category; the number of target unlabeled image samples corresponds to the adjusted sample selection quantity corresponding to the category.
[0160] As an optional implementation of this application, the above-mentioned sample selection quantity adjustment module is specifically used for:
[0161] For each category in the target category group that does not meet the conditions, determine the difference N between the number of samples selected for that category and the number of unlabeled image samples belonging to that category;
[0162] Adjust the sample selection quantity for this category to the number of unlabeled image samples belonging to this category;
[0163] Based on the difference N, determine the sample selection increment for other categories that meet the conditions in the target category group, so as to adjust the sample selection increment of each category corresponding to the additional allocated sample selection increment. The sum of the sample selection increments of other categories that meet the conditions is the difference N.
[0164] As an optional implementation of this application embodiment, the above-mentioned feature extraction module is specifically used for:
[0165] Unlabeled image samples are input into a trained classification model, and the feature extraction layer in the classification model extracts features from the unlabeled image samples and outputs the image features of the unlabeled image samples.
[0166] Obtain the image features of unlabeled image samples output by the classification model.
[0167] Optionally, the above data mining module also includes:
[0168] The sample acquisition module is used to acquire labeled image samples;
[0169] The input module is used to input the labeled image samples into the neural network model, and the neural network model performs feature extraction and classification on the labeled image samples to obtain the image features of the labeled image samples and the confidence of the labeled image samples belonging to each category in the first category group. The first category group contains at least one category corresponding to the labeled image samples.
[0170] The loss calculation module is used to calculate the loss value based on the confidence level of the labeled image sample belonging to each category in the first category group;
[0171] The parameter tuning module is used to adjust the parameters of the neural network model based on the loss value and using the backpropagation method until the set parameter conditions are met, thus obtaining the classification model.
[0172] As an optional implementation of this application embodiment, the above-mentioned data mining apparatus further includes:
[0173] The detection module is used to identify unlabeled target image samples as new labeled image samples when they are detected to be labeled.
[0174] The optimization module is used to optimize the classification model based on new labeled image samples.
[0175] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0176] This concludes the process. Figure 7 Structural description of the device shown.
[0177] Correspondingly, embodiments of this application also provide Figure 7 The hardware structure diagram of the device shown is as follows: Figure 8 As shown, the electronic device can be a device implementing the above-described method. Figure 8 As shown, the hardware architecture includes a processor and memory.
[0178] Among them, the memory is used to store machine-executable instructions;
[0179] A processor is used to read and execute machine-executable instructions stored in memory to implement the corresponding data mining method embodiment shown above.
[0180] As one embodiment, the memory can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For example, the memory can be volatile memory, non-volatile memory, or similar storage media. Specifically, the memory can be RAM (Random Access Memory), flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.
[0181] This concludes the process. Figure 8 Description of the electronic device shown.
[0182] The above are merely preferred embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A data mining method, characterized in that, The method is applied to an electronic device, and the method includes: Feature extraction is performed on the unlabeled image samples to obtain the image features of the unlabeled image samples; Based on the image features of each unlabeled image sample, determine the category corresponding to each unlabeled image sample; If at least two categories have been identified and are in the same target category group, then for each category in the target category group that meets the following condition: the number of unlabeled image samples belonging to the category is less than the sample selection amount corresponding to the category, the sample selection amount corresponding to the category is adjusted to the number of unlabeled image samples belonging to the category, and based on the difference between the sample selection amount corresponding to the category and the number, the sample selection amount increment allocated to other categories in the target category group that do not meet the condition is determined, so as to adjust the sample selection amount of other categories by the newly allocated sample selection amount increment; Based on the adjusted sample selection quantity corresponding to each category in the target category group, select the target unlabeled image sample corresponding to the sample selection quantity from all unlabeled image samples belonging to that category.
2. The method according to claim 1, characterized in that, The step of determining the category corresponding to each unlabeled image sample based on its image features includes: For each unlabeled image sample, the confidence level of the unlabeled image sample belonging to each category in the obtained first category group is determined based on the image features of the unlabeled image sample; the first category group includes at least one category corresponding to the labeled image sample, and the labeled image sample is the sample used by the trained classification model during the training process; Based on the confidence level that the unlabeled image sample belongs to each category in the first category group, the category corresponding to the unlabeled image sample is determined. The target category group is the first category group.
3. The method according to claim 2, characterized in that, The determination of the confidence level of the unlabeled image sample belonging to each category in the first category group is performed under the premise that the electronic device has not yet obtained the second category group; the second category group contains at least one category other than the category corresponding to the labeled image sample.
4. The method according to claim 1, characterized in that, The step of determining the category corresponding to each unlabeled image sample based on its image features includes: For each unlabeled image sample, the distance between the image features of the unlabeled image sample and the image features of each category in the first category group is calculated, and the distance between the image features of the unlabeled image sample and the image features of each category in the second category group is also calculated. The first category group contains at least one category corresponding to the labeled image sample, and the second category group contains at least one category other than the category corresponding to the labeled image sample. The labeled image sample is the sample used by the trained classification model during the training process. The category corresponding to the unlabeled image sample is determined based on the distance between the image features of the unlabeled image sample and the image features of each category in each category group.
5. The method according to claim 4, characterized in that, The step of determining the category corresponding to the unlabeled image sample based on the distance between the image features of the unlabeled image sample and the image features of each category in each category group includes: If the distance between the image features of the unlabeled image sample and the image features of a category in the second category group is less than the distance between the image features of the unlabeled image sample and the image features of each category in the first category group, then a category is selected from the second category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest; otherwise, a category is selected from the first category group as the category corresponding to the unlabeled image sample, wherein the distance between the image features corresponding to the selected category and the image features of the unlabeled image sample is the smallest. The target category group is either the second category group or the first category group.
6. The method according to any one of claims 2 to 5, characterized in that, The target category group is the first category group; The step of selecting target unlabeled image samples corresponding to the adjusted sample selection quantity for each category in the target category group, from all unlabeled image samples belonging to that category, includes: For each category in the target category group, the sample uncertainty of each unlabeled image sample belonging to that category is obtained; the sample uncertainty of the unlabeled image sample is used to indicate the accuracy of the classification model trained on the unlabeled image sample for classification. Based on the adjusted sample selection quantity corresponding to the category and the sample uncertainty of each unlabeled image sample, the target unlabeled image sample is selected from all unlabeled image samples belonging to the category; the number of the target unlabeled image samples corresponds to the adjusted sample selection quantity corresponding to the category.
7. The method according to claim 1, characterized in that, The sum of the increases in the sample selection volume for other categories is the difference N.
8. The method according to claim 1, characterized in that, The step of extracting features from unlabeled image samples to obtain image features of the unlabeled image samples includes: The unlabeled image samples are input into a trained classification model, and the feature extraction layer in the classification model extracts features from the unlabeled image samples and outputs the image features of the unlabeled image samples. Obtain the image features of the unlabeled image samples output by the classification model.
9. The method according to claim 8, characterized in that, The classification model is trained through the following steps: Obtain labeled image samples; The labeled image samples are input into a neural network model, which extracts and classifies the labeled image samples to obtain the image features of the labeled image samples and the confidence scores of the labeled image samples belonging to each category in a first category group. The first category group contains at least one category corresponding to the labeled image samples. The loss value is calculated based on the confidence level of the labeled image sample belonging to each category in the first category group; The parameters of the neural network model are adjusted based on the loss value and the backpropagation method until the set parameter conditions are met, thus obtaining the classification model.
10. The method according to claim 9, characterized in that, After selecting the target unlabeled image samples corresponding to the sample selection quantity, the method further includes: When an unlabeled image sample of a target is detected to be labeled, the unlabeled image sample of the target is determined as a new labeled image sample; The classification model is optimized based on the new labeled image samples.
11. A data mining device, characterized in that, The device includes: The feature extraction module is used to extract features from unlabeled image samples to obtain the image features of the unlabeled image samples; The category determination module is used to determine the category of each unlabeled image sample based on its image features. The sample selection quantity adjustment module is used to adjust the sample selection quantity of the target category group to the number of unlabeled image samples belonging to the category if at least two categories have been determined to be in the same target category group. The module is used to adjust the sample selection quantity of the target category to the number of unlabeled image samples belonging to the category if the number of unlabeled image samples belonging to the category is less than the sample selection quantity of the category. The module is also used to determine the sample selection quantity increment to be allocated to other categories in the target category group that do not meet the condition based on the difference between the sample selection quantity of the category and the number of unlabeled image samples. The module is used to adjust the sample selection quantity of other categories to the newly allocated sample selection quantity increment. The mining module is used to select target unlabeled image samples corresponding to the adjusted sample selection quantity from all unlabeled image samples belonging to the target category group.
12. An electronic device, characterized in that, Electronic devices include machine-readable storage media and processors; The machine-readable storage medium stores machine-executable instructions that can be executed by a processor; The processor is used to read the machine-executable instructions to implement the steps of the data mining method as described in any one of claims 1-10.