Similar Data Cleaning Method and Apparatus
By using a weighted average of multiple similarity calculation methods and training the EncoderQ feature encoder network, the high latency and adaptability issues of traditional methods in autonomous driving data cleaning are solved. This achieves efficient and robust similar data cleaning, adapts to diverse scenarios, and eliminates the need for manual labeling.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- COWA TECHNOLOGY CO LTD
- Filing Date
- 2023-03-11
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional similarity calculation methods suffer from high latency and poor adaptability of single algorithms in large-scale data cleaning for autonomous driving data closed loops, making them unsuitable for diverse scenarios.
A weighted average of multiple similarity calculation methods is used, combined with the EncoderQ feature encoder network for training. The EncoderQ feature encoder network encodes the image data, and similar data is removed using a similarity threshold to form a preliminary cleaned dataset, which is then cleaned a second time in the database.
It achieves high-precision, high-robustness, high-concurrency, and low-latency similar data cleaning, adapts to diverse scenarios, is easy to operate, and requires no manual labeling.
Smart Images

Figure CN116091872B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of autonomous driving technology, specifically relating to a similar data cleaning method and apparatus. Background Technology
[0002] With the advent of the data era, in order to extract effective image data, large-scale data cleaning is required in the closed-loop data processing of autonomous driving. This involves cleaning the image data and removing duplicate and relatively high-resolution image data. However, traditional cleaning methods have the following problems.
[0003] First, traditional similarity calculation methods, such as SSIM, histograms, mutual information, and perceptual hashing algorithms, all suffer from high latency and poor adaptability of individual algorithms. This is especially true in the large-scale data cleaning of autonomous driving data loops, where the calculation of similarity between images and massive amounts of imported images presents an extremely high latency problem.
[0004] Second, traditional single algorithms cannot adapt to diverse scenarios. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention proposes a method and apparatus for cleaning similar data. This invention can be applied in the field of autonomous driving technology to efficiently remove similar images in the closed-loop data of autonomous driving.
[0006] To achieve the above objectives, the technical solution of the present invention is as follows:
[0007] On one hand, this invention discloses a method for cleaning similar data, comprising the following steps:
[0008] S1: Collect multiple data points to form the original dataset;
[0009] S2: Input the original dataset and train the feature encoder network EncoderQ;
[0010] S3: Input all the data from the original dataset into the trained feature encoder network EncoderQ to obtain the similarity between any two data points;
[0011] S4: For any data in the original dataset, determine whether its similarity with any other data in the original dataset exceeds a threshold. If it does, remove the data; otherwise, retain the data.
[0012] After evaluating all data in the original dataset, data with high similarity are removed, resulting in a preliminary cleaned dataset.
[0013] Based on the above technical solution, the following improvements can be made:
[0014] As a preferred option, S1 also includes preprocessing the data.
[0015] As a preferred embodiment, S2 includes:
[0016] S2.1: Perform at least two different preprocessing operations on each data point in the original dataset to form a training dataset;
[0017] S2.2: Calculate the similarity between any data in the training dataset and other data using multiple similarity calculation methods, and obtain the similarity label Label of the data by weighted averaging.
[0018] S2.3: Input the training dataset into the feature encoder network EncoderQ to obtain the encoded features, and calculate the similarity S between any encoded feature and other encoded features;
[0019] S2.4: Calculate the SmoothL1 loss between similarity S and the corresponding similarity label Label, and backpropagate to update the parameters of the feature encoder network EncoderQ until the feature encoder network EncoderQ is trained.
[0020] As a preferred approach, the EncoderQ feature encoder network integrates multiple different similarity calculation methods.
[0021] As a preferred embodiment, S4 also includes:
[0022] S4.1: Input all the data from the pre-cleaned dataset and the data from the database into the trained feature encoder network EncoderQ;
[0023] S4.2: For any data in the initial cleaned dataset, determine whether its similarity with any other data in the database exceeds a threshold. If it does, remove the data; otherwise, retain the data.
[0024] After judging all the data in the initial clean set, data with high similarity is removed, and the remaining data is saved into the database.
[0025] On the other hand, the present invention also discloses a similar data cleaning apparatus, comprising:
[0026] Data acquisition module: The data acquisition module is used to collect multiple data points to form a raw dataset;
[0027] Pre-training module: The pre-training module is used to train the feature encoder network EncoderQ using the original input dataset;
[0028] Preliminary similarity calculation module: The preliminary similarity calculation module is used to input all data in the original dataset into the trained feature encoder network EncoderQ to obtain the similarity between any two data.
[0029] Cleaning module: The cleaning module is used to determine whether the similarity between any data in the original dataset and any other data in the original dataset exceeds a threshold. If it does, the data is removed; otherwise, the data is retained.
[0030] After evaluating all data in the original dataset, data with high similarity are removed, resulting in a preliminary cleaned dataset.
[0031] As a preferred option, the data acquisition module also includes data preprocessing.
[0032] As a preferred approach, the pre-training module includes:
[0033] Data preprocessing unit: The data preprocessing unit is used to perform at least two different preprocessing operations on each data point in the sampled dataset to form a training dataset;
[0034] Similarity Calculation Unit: The similarity calculation unit is used to calculate the similarity between any data in the training dataset and other data using multiple similarity calculation methods, and obtain the similarity label of the data by weighted averaging.
[0035] Feature acquisition unit: The feature acquisition unit is used to input the training dataset into the feature encoder network EncoderQ to obtain the encoded features and calculate the similarity S between any encoded feature and other encoded features;
[0036] Training Unit: The training unit is used to calculate the SmoothL1 loss between similarity S and the corresponding similarity label, and backpropagate to update the parameters of the feature encoder network EncoderQ until the feature encoder network EncoderQ is trained.
[0037] As a preferred approach, the EncoderQ feature encoder network integrates multiple different similarity calculation methods.
[0038] As a preferred embodiment, the cleaning module also includes:
[0039] Data Import Unit: The data import unit is used to input all the data from the pre-cleaned dataset and the data from the database into the trained feature encoder network EncoderQ;
[0040] Secondary similarity calculation unit: The secondary similarity calculation unit is used to determine whether the similarity between any data in the preliminary cleaned dataset and any other data in the database exceeds a threshold. If it does, the data is removed; otherwise, the data is retained.
[0041] After judging all the data in the initial clean set, data with high similarity is removed, and the remaining data is saved into the database.
[0042] This invention discloses a similar data cleaning method and apparatus, which has the following beneficial effects:
[0043] First, this invention integrates multiple similarity calculation methods through a weighted average, making it adaptable to diverse scenarios and more robust.
[0044] Secondly, this invention eliminates the need for manual labeling, making it easy to operate.
[0045] Third, the similar data cleaning method of the present invention has the characteristics of high precision, high robustness, high concurrency, and low latency. Attached Figure Description
[0046] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0047] Figure 1 A flowchart of a similar data cleaning method provided in an embodiment of the present invention.
[0048] Figure 2 The diagram shows the Embedding encoder network model provided in this embodiment of the invention.
[0049] Figure 3 The flowchart for training the EncoderQ feature encoder network provided in this embodiment of the invention is shown.
[0050] Figure 4 A flowchart of a method for cleaning similar images provided in an embodiment of the present invention. Detailed Implementation
[0051] The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0052] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0053] Furthermore, the expression "includes" is an "open-ended" expression, which means only that there is a corresponding component or step, and should not be interpreted as excluding additional components or steps.
[0054] To achieve the objectives of this invention, some embodiments of similar data cleaning methods and apparatus, such as... Figure 1 As shown, the similar data cleaning method includes the following steps:
[0055] S1: Collect multiple data points to form the original dataset;
[0056] S2: Input the original dataset and train the feature encoder network EncoderQ;
[0057] S3: Input all the data from the original dataset into the trained feature encoder network EncoderQ to obtain the similarity between any two data points;
[0058] S4: For any data in the original dataset, determine whether its similarity with any other data in the original dataset exceeds a threshold. If it does, remove the data; otherwise, retain the data.
[0059] After evaluating all data in the original dataset, data with high similarity are removed, resulting in a preliminary cleaned dataset.
[0060] This invention is based on the EncoderQ feature encoder network, which encodes image data into high-dimensional feature vectors. Then, the similarity between images is calculated using the high-dimensional feature vectors. By setting a similarity threshold, similar data can be filtered and cleaned.
[0061] To further optimize the implementation effect of the present invention, in some other embodiments, the remaining technical features are the same, except that S1 further includes preprocessing the data. Preprocessing includes cropping, scaling, flipping, color transformation, etc. of the data.
[0062] To further optimize the implementation effect of the present invention, in some other embodiments, the remaining technical features are the same, except that S4 further includes:
[0063] S4.1: Input all the data from the pre-cleaned dataset and the data from the encoding database into the trained feature encoder network EncoderQ;
[0064] S4.2: For any data in the initial cleaned dataset, determine whether its similarity with any other data in the coding database exceeds the threshold. If it does, remove the data.
[0065] Otherwise, retain the data;
[0066] After judging all the data in the initial clean set, data with high similarity are removed, and the remaining data is saved into the coding database.
[0067] S4 further utilizes the trained feature encoder network EncoderQ to encode the features of the image data and calculates the feature similarity with the encoded database. By setting a threshold, it determines whether to remove the images to be processed, thus achieving the purpose of data cleaning.
[0068] To further optimize the implementation effect of the present invention, in some other embodiments, the remaining technical features are the same, except that S2 includes:
[0069] S2.1: Perform at least two different preprocessing operations on each data point in the original dataset to form a training dataset;
[0070] S2.2: Calculate the similarity between any data in the training dataset and other data using multiple similarity calculation methods, and obtain the similarity label Label of the data by weighted averaging.
[0071] S2.3: Input the training dataset into the feature encoder network EncoderQ to obtain the encoded features, and calculate the similarity S between any encoded feature and other encoded features;
[0072] S2.4: Calculate the SmoothL1 loss between similarity S and the corresponding similarity label Label, and backpropagate to update the parameters of the feature encoder network EncoderQ until the feature encoder network EncoderQ is trained.
[0073] The aforementioned feature encoder network EncoderQ integrates multiple similarity calculation methods, including SSIM, histogram, mutual information, and perceptual hashing algorithms.
[0074] In some specific embodiments, the present invention can be applied, but is not limited to, to the problem of efficient removal of similar images in autonomous driving data closed loops, wherein: the data is images.
[0075] During training, for each input image X to be cleaned, the positive sample similarity S is calculated using various similarity calculation methods. pos =f(X, Pos) and the similarity S of negative samples neg =f(X, Neg), where f represents the traditional similarity calculation method, and the similarity value is represented by 0~1, where 0 represents dissimilarity and 1 represents similarity; where positive sample Pos and X are different preprocessed images from the same image, and negative sample Neg is another random image from a different source. The similarity values generated by different traditional algorithms are weighted and averaged to obtain a pair of positive / negative similarity labels for the image.
[0076] Among them, the aforementioned feature encoder network EncoderQ is an Embedding encoder, and the Embedding encoder Backbone adopts the lightweight MobileNetV2 network model, such as... Figure 2 As shown, it can extract high-dimensional semantic features from input images, with a small number of model parameters and fast running speed. The output layer is replaced with a fully connected neural network to implement feature mapping and embedding encoding.
[0077] like Figure 3 As shown below, each training step will be described in detail.
[0078] S2.1: Perform at least two different preprocessing operations on each data point in the original dataset to form the training dataset. Details are as follows:
[0079] Randomly sample a batch of data X, X_∈R (BxHxWxC) Where B is the number of images in a sampling batch, and H, W, and C are the height, width, and number of channels of the image, respectively;
[0080] Preprocess the data X, including cropping, scaling, flipping, and color transformation;
[0081] X is obtained by randomly performing two different preprocessing methods, resulting in X_1 and X_2;
[0082] X_ is obtained by a preprocessing method to obtain X_3.
[0083] For any i, X_1i and X_2i are two different preprocessing methods from the same image data, while X_1i and X_3i are different preprocessing methods from different image data. In this way, each image data X_1i has a pair of positive and negative samples (X_2i and X_3i) participating in the learning process. By using different preprocessing methods for different images, the distinction between positive and negative samples is increased, which is beneficial to the subsequent training of the encoder.
[0084] S2.2: Calculate the similarity between any data point in the training dataset and other data points using multiple similarity calculation methods, and obtain the similarity label (Label) for that data point through a weighted average. Details are as follows:
[0085] The similarity values f(x_1,x_2) between X_1 and (X_2,X_3) are calculated using various similarity calculation methods, where f(x_1,x_3)∈f∈R. (Bx1) The combination yields f∈R (Bx2) Similarity labels (Labels) are obtained using a weighted average method.
[0086] ;
[0087] Where: W_i is the corresponding weight.
[0088] Compared to traditional single similarity calculation methods, which solve the problems of high latency and adaptive lookup, this invention integrates multiple similarity calculation methods through a weighted average method, making it more robust and adaptable to diverse scenarios.
[0089] S2.3: Input the training dataset into the feature encoder network EncoderQ to obtain encoded features, and calculate the similarity S between any encoded feature and other encoded features. The details are described below:
[0090] Data X_1 is encoded by the Embedding encoder Q (network parameters are learnable) to obtain feature Y. q ∈R (Bx1xd) d is 128.
[0091] Data X_2 is encoded into Y by the Embedding encoder K (whose network parameters are not learnable and are dynamically updated by encoder Q). k1 ∈R (Bx1xd) ;
[0092] X_3 is encoded by the Embedding encoder K to obtain feature Y. k2 ∈R (Bx1xd) ;
[0093] Y k1 With Y k2 The result of merging is Y k =[Y k1 ,Y k2 ]∈R (Bx2xd) Y q With Y k The batch dimension is extracted using an encoder, and the dot product of the remaining two dimensions is calculated to obtain Y. q With Y kBatch-dimensional similarity S = torch.matmul(Y q, Y k )∈R (Bx2) Multiplication of dimensional matrices.
[0094] The Bx2 matrix obtained from the above formula
[0095] Where: the first row represents the similarity between X1 and the positive sample X2, and the second row represents the similarity between X1 and the negative sample X3.
[0096] S2.4: Calculate the SmoothL1 loss between the similarity S and the corresponding similarity label Label, and backpropagate to update the parameters of the feature encoder network EncoderQ until the feature encoder network EncoderQ is trained. The details are as follows:
[0097] Calculate the SmoothL1 loss between similarity S and the corresponding similarity label Label, and backpropagate to update the model parameters.
[0098] ;
[0099] ;
[0100] After training, the trained feature encoder network EncoderQ is obtained.
[0101] The Embedding encoders Q and K described above have the same structure. During training, the parameters of Embedding encoder Q are updated by gradients, while the parameters of Embedding encoder K are updated by the parameter momentum of Embedding encoder Q. The loss function is the SmoothL1 loss between the model's predicted similarity and the similarity label.
[0102] Based on the above description, such as Figure 4 As shown below, the cleaning of similar images will be described in detail.
[0103] 1) Collect multiple data sets to form the original dataset and initialize and shuffle it; initialize the encoding database E and the source image database I;
[0104] 2) Data preprocessing: Randomly select one batch of images as input X∈R (BxHxWxC) The enhanced input X_ is obtained through data preprocessing.
[0105] 3) Input X_ into the trained Embedding encoder Q to calculate the encoded features F∈R. (Bx1xd) .
[0106] 4) Intra-batch similarity calculation: Calculate the similarity between the encoded feature F and its own batch-dimensional features to obtain the similarity S1∈R between images in X. (BxB) For the i-th image X in X i Image X of the jth generation j S1[i,j] represents the image X i With image X j The similarity is S1[i,i], where the diagonal value of S1 represents the similarity between image i and itself, which is 1.
[0107] 5) Image cleaning within a batch, for any X j For any x ∈ X (j ≠ i), if there exists S1[i,j] > K, it indicates that there exists an image X in the current batch X that is related to X. i Similar images should be removed. i Otherwise, it means X i Since Fi is not similar to any other image in X, add Fi to F_ and record the index i of Fi;
[0108] Traverse X to obtain the pre-cleaned encoded features F_∈R (B_x1xd) , B_<=B.
[0109] 6) Clean the remaining data and calculate the coding features F_ and E∈R in the coding database. (Nx1xd) Feature similarity S2∈R (B_xN) If S2[i,j]>K, then the i-th image in X is considered similar to an image in the existing database, and X is removed. i Conversely, image X is considered to be... i If the image is dissimilar to all images in the database, perform an insertion operation and encode the feature F. i Add image i to the encoding database E, and add image i to the image database I.
[0110] Repeat the above process until all images in the dataset are traversed to obtain the final source image database I. The similarity between any two images in I is less than the threshold K, thus achieving the purpose of data cleaning.
[0111] The similarity threshold K mentioned above can be, but is not limited to, 0.85.
[0112] This invention can be applied to the field of autonomous driving technology, providing a method for training an efficient image encoder using a semi-supervised learning approach within a data closed loop. In the data cleaning stage, the image encoder encodes images into embeddings, then calculates the similarity between these embeddings and existing image embeddings in the database. Finally, based on a data similarity threshold, it determines whether the image should be retained or removed from the database.
[0113] Furthermore, embodiments of the present invention also disclose a similar data cleaning apparatus, comprising:
[0114] Data acquisition module: The data acquisition module is used to collect multiple data points to form a raw dataset;
[0115] Pre-training module: The pre-training module is used to train the feature encoder network EncoderQ using the original input dataset;
[0116] Preliminary similarity calculation module: The preliminary similarity calculation module is used to input all data in the original dataset into the trained feature encoder network EncoderQ to obtain the similarity between any two data.
[0117] Cleaning module: The cleaning module is used to determine whether the similarity between any data in the original dataset and any other data in the original dataset exceeds a threshold. If it does, the data is removed; otherwise, the data is retained.
[0118] After evaluating all data in the original dataset, data with high similarity are removed, resulting in a preliminary cleaned dataset.
[0119] Furthermore, the data acquisition module also includes data preprocessing.
[0120] Furthermore, the pre-trained module includes:
[0121] Data preprocessing unit: The data preprocessing unit is used to perform at least two different preprocessing operations on each data point in the sampled dataset to form a training dataset;
[0122] Similarity Calculation Unit: The similarity calculation unit is used to calculate the similarity between any data in the training dataset and other data using multiple similarity calculation methods, and obtain the similarity label of the data by weighted averaging.
[0123] Feature acquisition unit: The feature acquisition unit is used to input the training dataset into the feature encoder network EncoderQ to obtain the encoded features and calculate the similarity S between any encoded feature and other encoded features;
[0124] Training Unit: The training unit is used to calculate the SmoothL1 loss between similarity S and the corresponding similarity label, and backpropagate to update the parameters of the feature encoder network EncoderQ until the feature encoder network EncoderQ is trained.
[0125] Furthermore, the feature encoder network EncoderQ integrates multiple different similarity calculation methods.
[0126] Furthermore, the cleaning module also includes:
[0127] Data Import Unit: The data import unit is used to input all the data from the pre-cleaned dataset and the data from the database into the trained feature encoder network EncoderQ;
[0128] Secondary similarity calculation unit: The secondary similarity calculation unit is used to determine whether the similarity between any data in the preliminary cleaned dataset and any other data in the database exceeds a threshold. If it does, the data is removed; otherwise, the data is retained.
[0129] After judging all the data in the initial clean set, data with high similarity is removed, and the remaining data is saved into the database.
[0130] The specific embodiments and methods of the similar data cleaning apparatus disclosed in this invention are similar and will not be described again here.
[0131] This invention discloses a similar data cleaning method and apparatus, which has the following beneficial effects:
[0132] First, this invention integrates multiple similarity calculation methods through a weighted average, making it adaptable to diverse scenarios and more robust.
[0133] Secondly, this invention eliminates the need for manual labeling, making it easy to operate.
[0134] Third, the similar data cleaning method of the present invention has the characteristics of high precision, high robustness, high concurrency, and low latency.
[0135] It should be understood that the various techniques described herein can be implemented in combination with hardware or software, or a combination thereof. Thus, certain aspects or portions of the methods and apparatus of the present invention, or the methods and devices of the present invention, may take the form of program code (i.e., instructions) embedded in a tangible medium, such as a floppy disk, CD-ROM, hard disk, or any other machine-readable storage medium, wherein when the program is loaded into and executed by a machine such as a computer, the machine becomes an apparatus for practicing the present invention.
[0136] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "setting," "connection," "fixing," "screw connection," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal connection of two components or the interaction between two components. Unless otherwise explicitly limited, those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0137] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the present invention. Various changes and modifications can be made to the present invention without departing from its spirit and scope. All such changes and modifications fall within the scope of the present invention as claimed, which is defined by the appended claims and their equivalents.
[0138] The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it. They should not be used to limit the scope of protection of the present invention. All equivalent changes or modifications made in accordance with the spirit and essence of the present invention should be covered within the scope of protection of the present invention.
Claims
1. A method for cleaning similar data, characterized in that, Includes the following steps: S1: Collect multiple autonomous driving image data to form a raw dataset; S2: Input the original dataset and train the feature encoder network; The feature encoder network includes encoder Q and encoder K, which have the same network structure; The training process is as follows: S2.1: Perform at least two different preprocessing operations on each image data in the original dataset to form a training dataset; The training dataset includes: several training sample pairs, each training sample pair including: original image data and corresponding positive and negative samples; S2.2: For each training sample pair in the training dataset, the similarity between the original image data and the positive and negative samples corresponding to the original image data is calculated using multiple similarity calculation methods. The similarity obtained by multiple algorithms is fused by weighted averaging to generate the similarity label Label of the original image data. S2.3: For each training sample pair in the training dataset, feature extraction is performed using encoder Q and encoder K respectively. The parameters of encoder Q are learnable, while the parameters of encoder K are not directly learnable and are dynamically updated by the momentum of the parameters of encoder Q. The feature Y is obtained by encoding the original image data using encoder Q. q The features Y are obtained by encoding positive and negative samples respectively through encoder K. k1 Y k2 Y k1 With Y k2 The result of merging is Y k Calculate Y q With Y k The dot product yields the feature similarity S between the encoded features of the original image data and the encoded features of the positive and negative samples; S2.4: Calculate the SmoothL1 loss between feature similarity S and the corresponding similarity label Label, and backpropagate to update the parameters of encoder Q. The parameters of encoder K are updated synchronously with the momentum of encoder Q until the feature encoder network training is completed. S3: Input all image data in the autonomous driving dataset to be cleaned into the trained encoder Q, extract the feature codes of all image data, and calculate the feature similarity between any two image data in the dataset to be cleaned based on the feature codes. S4: For each image data in the autonomous driving cleaning dataset, determine whether its feature similarity with other image data in the autonomous driving cleaning dataset exceeds a preset threshold. If the number exceeds the limit, the image data will be classified as similar data and removed. Otherwise, retain the image data; After judging all the image data in the autonomous driving clean dataset, the image data with high similarity is removed to form a preliminary clean dataset.
2. The similar data cleaning method according to claim 1, characterized in that, S1 further includes preprocessing the image data.
3. The similar data cleaning method according to claim 1 or 2, characterized in that, The feature encoder network integrates multiple different similarity calculation methods.
4. The similar data cleaning method according to claim 1, characterized in that, S4 further includes: S4.1: Input all image data from the pre-cleaned dataset and the image data from the autonomous driving database into the trained encoder Q; S4.2: For any image data in the preliminary cleaned dataset, determine whether its similarity with any other image data in the autonomous driving database exceeds a preset threshold. If it does, remove the image data; otherwise, retain the image data. After judging all the image data in the initial cleaning set, the image data with high similarity is removed, and the remaining image data is saved into the autonomous driving database.
5. A similar data cleaning device, characterized in that, include: The data acquisition module is used to collect multiple autonomous driving image data to form a raw dataset; The pre-training module is used to train the feature encoder network by taking the original dataset as input. The feature encoder network includes encoder Q and encoder K, which have the same network structure; The pre-training module specifically includes: The data preprocessing unit is used to perform at least two different preprocessing operations on each image data in the original dataset to form a training dataset. The training dataset includes: several training sample pairs, each training sample pair including: original image data and corresponding positive and negative samples; The similarity calculation unit is used to calculate the similarity between the original image data and the positive and negative samples corresponding to the original image data for each training sample pair in the training dataset using multiple similarity calculation methods, and to fuse the similarity obtained by multiple algorithms by weighted averaging to generate the similarity label Label of the original image data. The feature acquisition unit is used to extract features for each training sample pair in the training dataset using encoder Q and encoder K respectively. The parameters of encoder Q are learnable, while the parameters of encoder K are not directly learnable and are dynamically updated by the momentum of the parameters of encoder Q. The feature Y is obtained by encoding the original image data using encoder Q. q The features Y are obtained by encoding positive and negative samples respectively through encoder K. k1 Y k2 Y k1 With Y k2 The result of merging is Y k Calculate Y q With Y k The dot product yields the feature similarity S between the encoded features of the original image data and the encoded features of the positive and negative samples; The training unit is used to calculate the SmoothL1 loss between the feature similarity S and the corresponding similarity label Label, and backpropagate to update the parameters of encoder Q. The parameters of encoder K are updated synchronously with the momentum of encoder Q until the feature encoder network training is completed. The preliminary similarity calculation module is used to input all image data in the autonomous driving cleaning dataset into the trained encoder Q, extract the feature codes of all image data, and calculate the feature similarity between any two image data in the cleaning dataset based on the feature codes. The cleaning module is used to determine whether the feature similarity between each image data in the autonomous driving cleaning dataset and other image data in the autonomous driving cleaning dataset exceeds a preset threshold. If the number exceeds the limit, the image data will be classified as similar data and removed. Otherwise, retain the image data; After judging all image data in the dataset to be cleaned, image data with high similarity are removed to form a preliminary cleaned dataset.
6. The similar data cleaning apparatus according to claim 5, characterized in that, The data acquisition module also includes preprocessing the image data.
7. The similar data cleaning apparatus according to claim 5 or 6, characterized in that, The feature encoder network integrates multiple different similarity calculation methods.
8. The similar data cleaning apparatus according to claim 5, characterized in that, The cleaning module also includes: The data import unit is used to input all image data from the pre-cleaned dataset and the image data from the autonomous driving database into the trained encoder Q. The secondary similarity calculation unit is used to determine whether the similarity between any image data in the preliminary cleaned dataset and any other image data in the autonomous driving database exceeds a preset threshold. If it does, the image data is removed; otherwise, the image data is retained. After judging all the image data in the initial cleaning set, the image data with high similarity is removed, and the remaining image data is saved into the autonomous driving database.