Data labeling method and device, electronic equipment, medium and program product
By clustering and label vector processing of multiple manually labeled results, cluster centers are formed, which solves the problem of large label differences in manual labeling methods, realizes intelligent and accurate data labeling, and ensures the uniformity of labeling standards.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2023-02-27
- Publication Date
- 2026-06-16
AI Technical Summary
Existing manual labeling methods are highly subjective, leading to significant differences in labels and making it difficult to establish unified classification standards, which affects the training effect of machine learning models.
By obtaining multiple manual annotation results, the label vectors of the data are constructed and clustered to form cluster centers. The annotation categories of the data are determined based on the cluster centers, thus realizing an automated and standardized annotation process.
It improves the intelligence and accuracy of data annotation, reduces inconsistencies in labels caused by differences in human cognition, and ensures the uniformity and comprehensiveness of annotation standards.
Smart Images

Figure CN116049698B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a data annotation method, apparatus, electronic device, medium, and computer program product. Background Technology
[0002] For machine learning-based classification problems, to achieve better classification results, the model needs a dataset with high-reliability classification labels. Currently, when training the initial model, dataset acquisition still mainly relies on manual annotation. The machine needs to be told a classification standard by the user before subsequent learning and training can proceed. However, manual annotation is highly subjective; for the same classification problem, each person's understanding differs, leading to significant variations in labels depending on the annotator, and making it difficult to establish a unified classification standard. Summary of the Invention
[0003] In view of this, the present disclosure provides a data annotation method, apparatus, electronic device, computer-readable storage medium, and computer program product that are highly intelligent and provide accurate and comprehensive annotation.
[0004] One aspect of this disclosure provides a data annotation method, comprising: obtaining annotation results of n data points by s manual annotations, wherein each manual annotation result is one of m annotation categories, s, m, and n are all integers greater than or equal to 1, and n is greater than m; processing the annotation result of each data point to obtain a label vector for each data point; clustering the n data points according to the label vector of each data point to form m clusters and m cluster centers corresponding one-to-one with the m clusters, wherein the cluster center is a label vector that can determine which annotation category each Ti data point of the cluster belongs to, Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m; determining which annotation category each Ti data point of the cluster belongs to based on the cluster center; and using the annotation category to which the Ti data points belong as the annotation label of the Ti data points.
[0005] According to the data annotation method of this disclosure, by processing multiple manual annotation results, the manual annotation results can be converted into data label vectors. Based on the label vector of each data point, n data points can be clustered to form m clusters and m cluster centers corresponding one-to-one with the m clusters. Based on the cluster centers, it can be determined which annotation category each of the Ti data points belongs to. Thus, the annotation category to which the Ti data points belong is used as the annotation label for the Ti data points. Compared with traditional manual annotation methods, the data annotation method of this disclosure has a high degree of intelligence. It does not use the manual annotation results as direct data labels, but processes and clusters them, avoiding the problem that annotation labels will vary greatly depending on the annotator due to differences in human cognition and subjectivity. This makes the classification standard of annotation labels more uniform. Furthermore, processing multiple manual annotation results provides more basic data, making the annotation labels more accurate and comprehensive.
[0006] In some embodiments, processing the annotation results of each data to obtain a label vector for each data includes: calculating the proportion of each annotation category in each data to the m annotation categories; and using the proportions of the m annotation categories to the m annotation categories as vector elements to construct a label vector for each data.
[0007] In some embodiments, determining which label category each of the Ti data points of a class belongs to based on the cluster center includes: sorting the vector elements of the cluster center; and taking the label category corresponding to the first or last sorted vector element as the label category of the Ti data points of that class.
[0008] In some embodiments, the step of clustering the n data points according to the label vector of each data point to form m clusters and m cluster centers corresponding one-to-one with the m clusters includes: Operation S41: According to the set number of clusters m, clustering the n data points using a clustering model to form m initial clusters, and randomly selecting the initial cluster center of each initial cluster; Operation S42: Calculating the sum of the Euclidean distances between the cluster center and each label vector other than the cluster center in each cluster, as the total cost of the cluster center; Operation S43: Randomly selecting auxiliary cluster centers of the cluster group, calculating the sum of the Euclidean distances between the auxiliary cluster center and each label vector other than the auxiliary cluster center in the cluster group, as the total cost of the auxiliary cluster center, wherein the auxiliary cluster center and the cluster center are not... The same label vector; Operation S44: Compare the total cost of the cluster centers of each of the m clusters with the total cost of the auxiliary cluster centers; Operation S45: When the total cost of the cluster centers of k of the m clusters is greater than the total cost of the auxiliary cluster centers, the auxiliary cluster centers of each of the k clusters are taken as new cluster centers, and the cluster centers of each of the remaining clusters remain unchanged; Operation S46: Based on the m cluster centers re-determined in Operation S45, the n data are clustered to form m clusters, and Operations S42 to S44 are repeated; and Operation S47: When the total cost of the cluster centers of each of the m clusters is less than the total cost of the auxiliary cluster centers, the cluster center of each of the m clusters is determined as the cluster center of that cluster.
[0009] In some embodiments, the clustering model includes: a K-means algorithm model, a hierarchical clustering model, or a Gaussian mixture model.
[0010] In some embodiments, the step of clustering the n data points into m clusters based on the m cluster centers re-determined in operation S45 includes: using the m cluster centers as m target label vectors, calculating the similarity between each label vector in the remaining label vectors of the n data points (excluding the m target label vectors) and each target label vector; and classifying the label vector into the same cluster as the target label vector with the first or last similarity ranking based on the similarity ranking between each of the remaining label vectors and each target label vector.
[0011] In some embodiments, the similarity includes one of Jaccard similarity coefficient, cosine similarity, Euclidean distance, and Pearson correlation coefficient.
[0012] Another aspect of this disclosure provides a data annotation apparatus, comprising: an acquisition module, configured to acquire annotation results of s individuals manually annotating n data points, wherein each individual's annotation result is one of m annotation categories, and s, m, and n are all integers greater than or equal to 1; a processing module, configured to process the annotation result of each data point to obtain a label vector for each data point; a clustering module, configured to cluster the n data points according to the label vector of each data point, forming m clusters and m cluster centers corresponding one-to-one with the m clusters, wherein each cluster center is a label vector that can determine which annotation category each Ti data point of that cluster belongs to, where Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m; a first determining module, configured to determine which annotation category each Ti data point of that cluster belongs to based on the cluster centers; and a second determining module, configured to use the annotation category to which the Ti data points belong as the label of the Ti data points.
[0013] Another aspect of this disclosure provides an electronic device including one or more processors and one or more memories, wherein the memories are used to store executable instructions that, when executed by the processor, implement the method described above.
[0014] Another aspect of this disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are used to implement the method described above.
[0015] Another aspect of this disclosure provides a computer program product including a computer program comprising computer executable instructions that, when executed, implement the method described above. Attached Figure Description
[0016] The above and other objects, features and advantages of this disclosure will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:
[0017] Figure 1 This illustration schematically shows an exemplary system architecture to which methods and apparatus can be applied according to embodiments of the present disclosure;
[0018] Figure 2 A flowchart illustrating a data annotation method according to an embodiment of the present disclosure is shown schematically;
[0019] Figure 3 This schematically illustrates a flowchart of processing the annotation results of each data point according to an embodiment of the present disclosure to obtain a label vector for each data point;
[0020] Figure 4 This schematically illustrates a flowchart of clustering n data points according to the label vector of each data point, forming m clusters and m cluster centers corresponding one-to-one with the m clusters, according to an embodiment of the present disclosure.
[0021] Figure 5 This schematically illustrates a flowchart of clustering n data points into m clusters based on m cluster centers redefined in operation S45 according to an embodiment of the present disclosure.
[0022] Figure 6 This schematically illustrates a flowchart illustrating how, according to an embodiment of the present disclosure, the Ti data points of a class are determined based on cluster centers to belong to which labeled category?
[0023] Figure 7 A schematic block diagram of a data annotation apparatus according to an embodiment of the present disclosure is shown.
[0024] Figure 8 A block diagram of an electronic device according to an embodiment of the present disclosure is shown schematically. Detailed Implementation
[0025] The embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of the disclosure. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of the present disclosure for ease of explanation. However, it will be apparent that one or more embodiments may be practiced without these specific details. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concepts of the present disclosure.
[0026] In the technical solution disclosed herein, the acquisition, storage, and application of user personal information all comply with relevant laws and regulations, necessary confidentiality measures have been taken, and there is no violation of public order and good morals. In the technical solution disclosed herein, the acquisition, collection, storage, use, processing, transmission, provision, disclosure, and application of data all comply with relevant laws and regulations, necessary confidentiality measures have been taken, and there is no violation of public order and good morals.
[0027] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0028] When using expressions such as "at least one of A, B, or C," it should generally be interpreted in accordance with the meaning commonly understood by those skilled in the art (e.g., "a system having at least one of A, B, or C" should include, but is not limited to, systems having A alone, having B alone, having C alone, having A and B, having A and C, having B and C, and / or having A, B, and C, etc.). The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, features defined with "first" or "second" may explicitly or implicitly include one or more of the stated features.
[0029] For machine learning-based classification problems, to achieve better classification results, the model needs a dataset with high-reliability classification labels. Currently, when training the initial model, dataset acquisition still mainly relies on manual annotation. The machine needs to be told a classification standard by the user before subsequent learning and training can proceed. However, manual annotation is highly subjective; for the same classification problem, each person's understanding differs, leading to significant variations in labels depending on the annotator, and making it difficult to establish a unified classification standard.
[0030] Embodiments of this disclosure provide a data annotation method, apparatus, electronic device, computer-readable storage medium, and computer program product. The data annotation method includes: obtaining annotation results from *s* individuals for *n* data points, where each individual annotation result represents one of *m* annotation categories, and *s*, *m*, and *n* are integers greater than or equal to 1; processing the annotation result for each data point to obtain a label vector for each data point; clustering the *n* data points based on their label vectors to obtain cluster centers, where each cluster center is a label vector that can determine which annotation category *Ti* data points belong to in that cluster, *Ti* being an integer greater than or equal to 0 and less than *n*, and *i* being an integer greater than or equal to 1 and less than or equal to *m*; determining which annotation category *Ti* data points belong to in that cluster based on the cluster centers; and using the annotation category to which the *Ti* data points belong as the annotation labels for the *Ti* data points.
[0031] It should be noted that the data annotation methods, devices, electronic devices, computer-readable storage media and computer program products disclosed herein can be used in the field of artificial intelligence technology, or in any field other than artificial intelligence technology, such as the financial field. The field of this disclosure is not limited here.
[0032] Figure 1 An exemplary system architecture 100, illustrating embodiments of the present disclosure, to which data annotation methods, apparatuses, electronic devices, computer-readable storage media, and computer program products can be applied. It should be noted that... Figure 1 The examples shown are merely examples of system architectures that can be applied to the embodiments of this disclosure, in order to help those skilled in the art understand the technical content of this disclosure, but do not mean that the embodiments of this disclosure cannot be used in other devices, systems, environments or scenarios.
[0033] like Figure 1 As shown, the system architecture 100 according to this embodiment may include terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.
[0034] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social media platform software, etc. (for example only).
[0035] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.
[0036] Server 105 can be a server that provides various services, such as a backend management server that supports websites browsed by users using terminal devices 101, 102, and 103 (for example only). The backend management server can analyze and process data such as received user requests, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices.
[0037] It should be noted that the data annotation method provided in this embodiment can generally be executed by server 105. Correspondingly, the data annotation apparatus provided in this embodiment can generally be located in server 105. The data annotation method provided in this embodiment can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the data annotation apparatus provided in this embodiment can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.
[0038] It should be understood that Figure 1The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0039] The following will be based on Figure 1 The described scene, through Figures 2-6 The data annotation method of the present disclosure will be described in detail.
[0040] Figure 2 A flowchart illustrating a data annotation method according to an embodiment of the present disclosure is shown schematically.
[0041] like Figure 2 As shown, the data annotation method in this embodiment includes operations S210 to S250.
[0042] In operation S210, the annotation results of s individuals on n data points are obtained. Each individual's annotation result is one of m annotation categories, where s, m, and n are all integers greater than or equal to 1, and n is greater than m. For example, the annotation results of s individuals on n data points can be shown in Table 1.
[0043] Table 1
[0044] Individual 1 Individual 2 ... Personal i ... Individuals Data 1 A11 A12 ... A1i ... A1s Data 2 A21 A22 ... A2i ... A2s ... ... ... ... ... ... ... Data j Aj1 Aj2 ... Aji ... Ajs ... ... ... ... ... ... ... Data n An1 An2 ... Ani ... Ans
[0045] Among them, A11~Ans are the annotation results, and each annotation result is one of the m annotation categories.
[0046] In operation S220, the annotation results of each data point are processed to obtain the label vector for each data point.
[0047] As an feasible approach, such as Figure 3 As shown, operation S220 processes the labeling results of each data point to obtain the label vector for each data point, including operations S221 and S222.
[0048] In operation S221, calculate the proportion of each label category out of m label categories in each data set. Taking the data in Table 1 as an example, in the labeling results A11 to A1s of data 1, assume that each label category has Kr elements, where Kr is an integer greater than or equal to 0, and r is an integer greater than or equal to 1 and less than or equal to m. Therefore, the proportion of each label category out of m label categories in data 1 can be obtained as Kr / s. The calculation method for the proportion of each label category out of m label categories in data 2 to data n is the same as that for data 1, and will not be repeated here.
[0049] In operation S222, the proportions of each of the m labeled categories to the total m labeled categories are used as vector elements to construct the label vector for each data point. Through operation S221, assuming the proportion of labeled category 1 in data i is Pi1, the proportion of labeled category 2 is Pi2, the proportion of labeled category 3 is Pi3, ..., and the proportion of labeled category m is Pim, the label vector for data i is obtained by using Pi1, Pi2, Pi3, ..., Pim as vector elements: Vi = {Pil, Pi2, Pi3, ..., Pim}. Where Pi1 + Pi2 + Pi3 + ... + Pim = 1. Operations S221 and S222 facilitate the processing of the labeling results for each data point to obtain the label vector for each data point.
[0050] In operation S230, the n data are clustered according to the label vector of each data, forming m clusters and m cluster centers corresponding to the m clusters. The cluster center is a label vector that can determine which label category each Ti data belongs to. Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m.
[0051] As one possible way to achieve this, such as Figure 4 As shown, operation S230 clusters n data points according to the label vector of each data point to form m clusters and m cluster centers corresponding to the m clusters, including operations S41 to S47.
[0052] In operation S41: Based on the set number of clusters m, the clustering model is used to cluster n data to form m initial clusters, and the initial cluster center of each initial cluster is randomly selected.
[0053] In some examples, the clustering model may include: K-means algorithm, hierarchical clustering, or Gaussian mixture model. Therefore, using K-means algorithm, hierarchical clustering, or Gaussian mixture model facilitates the clustering of n data points into m initial clusters, ensuring that the initial clusters and their initial cluster centers are close to the final clusters and their centers. This reduces the repetitive execution of operations S42 and S47, saving computational resources. Each initial cluster may include Ti data points, and the label vector corresponding to one of the Ti data points can be randomly selected as the initial cluster center of the initial cluster containing that Ti data point.
[0054] In operation S42: Calculate the sum of similarities between the cluster centers and every other label vector in each cluster group, which is used as the total cost of the cluster centers. It should be noted that when operation S42 is executed for the first time, the cluster centers and clusters in operation S42 are the initial cluster centers and clusters; when operation S42 is executed again, the cluster centers and clusters are the newly formed clusters. Calculating the sum of similarities between the cluster centers and every other label vector in each cluster group can be understood as calculating the sum of the Euclidean distances between the cluster centers and every other label vector in each cluster group.
[0055] In operation S43: Randomly select auxiliary cluster centers for the cluster group, and calculate the sum of similarities between the auxiliary cluster centers and every other label vector in the cluster group (excluding the auxiliary cluster centers). This sum is used as the total cost of the auxiliary cluster centers, where the auxiliary cluster centers and the cluster centers have different label vectors. Calculating the sum of similarities between the auxiliary cluster centers and every other label vector in the cluster group can be understood as calculating the sum of the Euclidean distances between the auxiliary cluster centers and every other label vector in the cluster group (excluding the auxiliary cluster centers).
[0056] In operation S44: compare the total cost of the cluster centers of each of the m clusters with the total cost of the auxiliary cluster centers.
[0057] In operation S45: when the total cost of the cluster centers of k out of m clusters is greater than the total cost of the auxiliary cluster centers, the auxiliary cluster centers of each of the k clusters are taken as the new cluster centers, and the cluster centers of each of the remaining clusters remain unchanged.
[0058] In operation S46: Based on the m cluster centers redefined in operation S45, cluster the n data to form m clusters, and repeat operations S42 to S44.
[0059] As an feasible approach, such as Figure 5 As shown, operation S46 clusters n data points based on the m cluster centers redefined in operation S45, forming m cluster groups, including operations S461 and S462.
[0060] In operation S461, taking m cluster centers as m target label vectors, calculate the similarity between each label vector and each target label vector in the remaining label vectors of n data (excluding the m target label vectors).
[0061] In some examples, similarity can include one of the following: Jaccard similarity coefficient, cosine similarity, Euclidean distance, and Pearson correlation coefficient. Thus, the similarity between each label vector and each target label vector in the remaining label vectors (excluding the m target label vectors) out of n data can be obtained by calculating the Jaccard similarity coefficient, cosine similarity, Euclidean distance, or Pearson correlation coefficient.
[0062] In operation S462, based on the similarity ranking between each of the remaining label vectors and each target label vector, the label vector is assigned to the same cluster as the target label vector with the highest or lowest similarity ranking. This means that when ranking similarity from highest to lowest, the label vector is assigned to the same cluster as the target label vector with the highest similarity ranking; conversely, when ranking similarity from lowest to highest, the label vector is assigned to the same cluster as the target label vector with the lowest similarity ranking. Therefore, operations S461 and S462 facilitate clustering of n data points based on the m cluster centers redefined in operation S45.
[0063] In operation S47: when the total cost of the cluster centers of each of the m clusters is less than the total cost of the auxiliary cluster centers, the cluster center of each of the m clusters is determined as the cluster center of that cluster. Operations S41 to S47 facilitate the clustering of n data points based on the label vector of each data point to obtain the cluster centers.
[0064] In operation S240, the label category to which Ti data points of a class belong is determined based on the cluster center.
[0065] As one possible way to achieve this, such as Figure 6 As shown, operation S240 determines which label category each of the Ti data points of a cluster belongs to based on the cluster center, including operations S241 and S242.
[0066] In operation S241, the vector elements of the cluster centers are sorted.
[0067] In operation S242, the label category corresponding to the first or last sorted vector element is taken as the label category of the Ti data points in that cluster. It should be noted that, assuming the cluster centers of a certain cluster are Vi = {Pi1, Pi2, Pi3, ..., Pim}, after sorting Pi1, Pi2, Pi3, ..., Pim from largest to smallest, the label category corresponding to the first sorted vector element is taken as the label category of the Ti data points in that cluster; conversely, after sorting Pi1, Pi2, Pi3, ..., Pim from smallest to largest, the label category corresponding to the last sorted vector element is taken as the label category of the Ti data points in that cluster. Therefore, operations S241 and S242 facilitate determining which label category each of the Ti data points in a cluster belongs to based on the cluster centers.
[0068] In operation S250, the annotation category to which Ti data belong is used as the annotation label for Ti data.
[0069] According to the data annotation method of this disclosure, by processing multiple manual annotation results, the manual annotation results can be converted into data label vectors. Based on the label vector of each data point, n data points can be clustered to form m clusters and m cluster centers corresponding one-to-one with the m clusters. Based on the cluster centers, it can be determined which annotation category each of the Ti data points belongs to. Thus, the annotation category to which the Ti data points belong is used as the annotation label for the Ti data points. Compared with traditional manual annotation methods, the data annotation method of this disclosure has a high degree of intelligence. It does not use the manual annotation results as direct data labels, but processes and clusters them, avoiding the problem that annotation labels will vary greatly depending on the annotator due to differences in human cognition and subjectivity. This makes the classification standard of annotation labels more uniform. Furthermore, processing multiple manual annotation results provides more basic data, making the annotation labels more accurate and comprehensive.
[0070] Based on the above data annotation method, this disclosure also provides a data annotation device 10. The following will be combined with... Figure 7 The data annotation device 10 is described in detail.
[0071] Figure 7 A schematic block diagram of a data annotation apparatus 10 according to an embodiment of the present disclosure is shown.
[0072] The data annotation device 10 includes an acquisition module 1, a processing module 2, a clustering module 3, a first determination module 4, and a second determination module 5.
[0073] Get module 1, which is used to perform operation S210: get the annotation results of s people for n data respectively, where each person's annotation result is one of m annotation categories, and s, m and n are all integers greater than or equal to 1.
[0074] Processing module 2 is used to perform operation S220: process the annotation results of each data to obtain the label vector of each data.
[0075] Clustering module 3 is used to perform operation S230: cluster the n data according to the label vector of each data to form m clusters and m cluster centers corresponding to the m clusters. The cluster center is a label vector that can determine which label category the Ti data of the class belongs to. Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m.
[0076] The first determining module 4 is used to perform operation S240: determine which label category each of the Ti data points of the cluster belongs to based on the cluster center.
[0077] The second determining module 5 is used to perform operation S250: taking the label category to which the Ti data belong as the label label of the Ti data.
[0078] According to some embodiments of this disclosure, the processing module may include a first calculation unit and a first determination unit.
[0079] The first calculation unit is used to calculate the proportion of each label category in each data set out of m label categories.
[0080] The first determining unit is used to construct the label vector of each data by taking the proportion of each of the m label categories as vector elements.
[0081] According to some embodiments of this disclosure, the first determining module may include a first sorting unit and a second determining unit.
[0082] The first sorting unit is used to sort the vector elements of the cluster centers.
[0083] The second determining unit is used to take the label category corresponding to the first or last sorted vector element as the label category of the Ti data of that category.
[0084] According to some embodiments of this disclosure, the clustering module may include a selection unit, a second calculation unit, a third calculation unit, a comparison unit, a third determination unit, a repeated execution unit, and a fourth determination unit.
[0085] Selecting a unit, the unit is used to operate S41: based on the set number of clusters m, the clustering model is used to cluster n data to form m initial clusters, and the initial cluster center of each initial cluster is randomly selected.
[0086] The second computational unit is used to perform operation S42: calculate the sum of the Euclidean distances between the cluster center and each label vector other than the cluster center in each cluster, as the total cost of the cluster center.
[0087] The third calculation unit is used to perform operation S43: randomly select auxiliary cluster centers of the cluster group, calculate the sum of the Euclidean distances between the auxiliary cluster centers and each label vector in the cluster group other than the auxiliary cluster centers, and use it as the total cost of the auxiliary cluster centers, where the auxiliary cluster centers and the cluster centers are different label vectors.
[0088] The comparison unit is used to perform operation S44: compare the total cost of the cluster centers of each of the m clusters with the total cost of the auxiliary cluster centers.
[0089] The third determining unit is used to operate S45: when the total cost of the cluster centers of k clusters out of m clusters is greater than the total cost of the auxiliary cluster centers, the auxiliary cluster centers of each of the k clusters are taken as new cluster centers, and the cluster centers of each of the remaining clusters remain unchanged.
[0090] The repeated execution unit is used for operation S46: based on the m cluster centers re-determined in operation S45, cluster the n data to form m cluster groups, and repeatedly execute operations S42 to S44.
[0091] The fourth determining unit is used to operate S47: when the total cost of the cluster center of each of the m clusters is less than the total cost of the auxiliary cluster center, the cluster center of each of the m clusters is determined as the cluster center of that cluster.
[0092] According to some embodiments of this disclosure, the repetitive execution unit may include a computing element and a partitioning element.
[0093] The computational element is used to calculate the similarity between each label vector and each target label vector in the remaining label vectors of n data (excluding the m target label vectors), with m cluster centers as m target label vectors.
[0094] The partitioning element is used to classify a label vector into the same cluster as the target label vector with the highest or lowest similarity ranking, based on the similarity ranking of each label vector among the remaining label vectors with respect to each target label vector.
[0095] Since the data annotation device 10 described above is based on the data annotation method, the beneficial effects of the data annotation device 10 are the same as those of the data annotation method, and will not be repeated here.
[0096] Furthermore, according to embodiments of this disclosure, any and multiple modules among the acquisition module 1, processing module 2, clustering module 3, first determination module 4, and second determination module 5 can be combined into one module, or any one of these modules can be split into multiple modules. Alternatively, at least some of the functions of one or more of these modules can be combined with at least some of the functions of other modules and implemented in one module.
[0097] According to embodiments of this disclosure, at least one of the acquisition module 1, processing module 2, clustering module 3, first determination module 4, and second determination module 5 can be at least partially implemented as hardware circuitry, such as field-programmable gate array (FPGA), programmable logic array (PLA), system-on-a-chip, system-on-a-substrate, system-on-package, application-specific integrated circuit (ASIC), or implemented in hardware or firmware by any other reasonable means of integrating or packaging the circuitry, or implemented in any one of the three methods of software, hardware, and firmware, or in a suitable combination of any of these.
[0098] Alternatively, at least one of the acquisition module 1, processing module 2, clustering module 3, first determination module 4, and second determination module 5 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.
[0099] Figure 8 A block diagram schematically illustrates an electronic device suitable for implementing the above-described method according to an embodiment of the present disclosure.
[0100] like Figure 8 As shown, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage portion 908 into a random access memory (RAM) 903. The processor 901 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 901 may also include onboard memory for caching purposes. The processor 901 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present disclosure.
[0101] RAM 903 stores various programs and data required for the operation of electronic device 900. Processor 901, ROM 902, and RAM 903 are interconnected via bus 904. Processor 901 performs various operations of the method flow according to embodiments of the present disclosure by executing programs in ROM 902 and / or RAM 903. It should be noted that the programs may also be stored in one or more memories other than ROM 902 and RAM 903. Processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in said one or more memories.
[0102] According to embodiments of this disclosure, the electronic device 900 may further include an input / output (I / O) interface 905, which is also connected to a bus 904. The electronic device 900 may also include one or more of the following components connected to the I / O interface 905: an input section 906 including a keyboard, mouse, etc.; an output section 907 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN card, modem, etc. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the input / output (I / O) interface 905 as needed. A removable medium 911, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 910 as needed so that computer programs read from it can be installed into the storage section 908 as needed.
[0103] This disclosure also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.
[0104] According to embodiments of this disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this disclosure, the computer-readable storage medium may include ROM 902 and / or RAM 903 and / or one or more memories other than ROM 902 and RAM 903 described above.
[0105] Embodiments of this disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods of the embodiments of this disclosure.
[0106] When the computer program is executed by the processor 901, it performs the functions defined in the system / apparatus of this disclosure embodiments. According to embodiments of this disclosure, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0107] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and downloaded and installed via the communication section 909, and / or installed from a removable medium 911. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0108] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 909, and / or installed from the removable medium 911. When the computer program is executed by the processor 901, it performs the functions defined in the system of this disclosure embodiment. According to embodiments of this disclosure, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0109] According to embodiments of this disclosure, program code for executing the computer programs provided in embodiments of this disclosure can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can execute entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0110] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0111] Those skilled in the art will understand that the features described in the various embodiments and / or claims of this disclosure can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. In particular, the features described in the various embodiments and / or claims of this disclosure can be combined and / or combined in various ways without departing from the spirit and teachings of this disclosure. All such combinations and / or combinations fall within the scope of this disclosure.
[0112] The embodiments of this disclosure have been described above. However, these embodiments are for illustrative purposes only and are not intended to limit the scope of this disclosure. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. The scope of this disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of this disclosure, and all such substitutions and modifications should fall within the scope of this disclosure.
Claims
1. A data annotation method, characterized in that, include: Obtain the annotation results of s individuals for n data points, where each individual's annotation result is one of m annotation categories, and s, m, and n are all integers greater than or equal to 1, with n being greater than m; The annotation results of each data point are processed to obtain the label vector for each data point. The n data are clustered according to the label vector of each data to form m clusters and m cluster centers corresponding to the m clusters. The cluster center is the label vector that can determine which label category each Ti data belongs to. Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m. Based on the cluster centers, determine which of the labeled categories each of the Ti data points in that class belongs to; and The label category to which the Ti data points belong is used as the label for the Ti data points. The process of processing the annotation results of each piece of data to obtain the label vector for each piece of data includes: Calculate the proportion of each of the stated label categories in each data set out of the m stated label categories; and The proportion of each of the m labeled categories is used as a vector element to construct a label vector for each of the data.
2. The method according to claim 1, characterized in that, The step of determining which labeled category each of the Ti data points of a cluster belongs to based on the cluster center includes: Sort the vector elements of the cluster centers; and Take the label category corresponding to the first or last vector element in the sorted order, and use it as the label category for the Ti data points of that category.
3. The method according to claim 1, characterized in that, The step of clustering the n data points based on the label vector of each data point to form m clusters and m cluster centers corresponding one-to-one with the m clusters includes: Operation S41: Based on the set number of clusters m, use the clustering model to cluster the n data to form m initial clusters, and randomly select the initial cluster center of each initial cluster. Operation S42: Calculate the sum of the Euclidean distances between the cluster center and each label vector other than the cluster center in each cluster group, as the total cost of the cluster center; Operation S43: Randomly select auxiliary cluster centers of the cluster group, calculate the sum of the Euclidean distances between the auxiliary cluster centers and each label vector in the cluster group other than the auxiliary cluster centers, and use it as the total cost of the auxiliary cluster centers, wherein the auxiliary cluster centers and the cluster centers are different label vectors; Operation S44: Compare the total cost of the cluster centers of each of the m clusters with the total cost of the auxiliary cluster centers; Operation S45: When the total cost of the cluster centers of k out of m clusters is greater than the total cost of the auxiliary cluster centers, the auxiliary cluster centers of each of the k clusters are taken as new cluster centers, and the cluster centers of each of the remaining clusters remain unchanged. Operation S46: Based on the m cluster centers redefined in operation S45, cluster the n data points to form m clusters, and repeat operations S42 to S44; and Operation S47: When the total cost of the cluster centers of each of the m clusters is less than the total cost of the auxiliary cluster centers, the cluster center of each of the m clusters is determined as the cluster center of that cluster.
4. The method according to claim 3, characterized in that, The clustering model includes: K-means algorithm model, hierarchical clustering model or Gaussian mixture model.
5. The method according to claim 3, characterized in that, The step of clustering the n data points into m clusters based on the m cluster centers redefined in operation S45 includes: Using the m cluster centers as m target label vectors, calculate the similarity between each label vector in the remaining label vectors (excluding the m target label vectors) of the n data and each of the target label vectors; and Based on the similarity ranking between each of the remaining label vectors and each of the target label vectors, the label vector is divided into the same cluster as the target label vector with the highest or lowest similarity ranking.
6. The method according to claim 5, characterized in that, The similarity includes one of the following: Jaccard similarity coefficient, cosine similarity, Euclidean distance, and Pearson correlation coefficient.
7. A data annotation device, characterized in that, include: The acquisition module is used to acquire the annotation results of s individuals for n data points, wherein each individual's annotation result is one of m annotation categories, and s, m, and n are all integers greater than or equal to 1. The processing module is used to process the annotation result of each piece of data to obtain a label vector for each piece of data; A clustering module is used to perform clustering of the n data based on the label vector of each data to form m clusters and m cluster centers corresponding one-to-one with the m clusters. The cluster center is the label vector that can determine which labeled category each Ti data belongs to, where Ti is an integer greater than or equal to 0 and less than n, and i is an integer greater than or equal to 1 and less than or equal to m. The first determining module is configured to perform the following actions: determining which labeled category each of the Ti data points of a cluster belongs to based on the cluster center; and The second determining module is used to determine the label category to which the Ti data belong as the label label for the Ti data. The process of processing the annotation results of each piece of data to obtain the label vector for each piece of data includes: Calculate the proportion of each of the stated label categories in each data set out of the m stated label categories; and The proportion of each of the m labeled categories is used as a vector element to construct a label vector for each of the data.
8. An electronic device, characterized in that, include: One or more processors; One or more memories are provided for storing executable instructions that, when executed by the processor, implement the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The storage medium stores executable instructions that, when executed by a processor, implement the method according to any one of claims 1 to 6.
10. A computer program product, characterized in that, The method includes a computer program comprising one or more executable instructions that, when executed by a processor, implement the method according to any one of claims 1 to 6.