Data classification method, apparatus, device, and storage medium

By using a closed-loop approach of automatically filtering and supplementing rules on the classification platform, the problems of accuracy and efficiency in data classification are solved, the performance of the model and the automation capabilities of the classification platform are improved, and the complexity of manual annotation is reduced.

CN122241313APending Publication Date: 2026-06-19TENCENT TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECH (BEIJING) CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing data classification methods suffer from poor accuracy and low efficiency. In particular, during model training, individual differences in the understanding of classification rules and standards among annotators make it difficult to guarantee the classification quality of challenging data, resulting in high manual annotation costs and low efficiency.

Method used

A data classification method is provided, which receives standard text of classification rules and data to be classified through a data input interface displayed on a classification platform, automatically filters out confusing data, and inputs supplementary text of classification rules in the rule supplementation interface. Based on the standard text and the supplementary text, the classification result of the confusing data is determined, thereby achieving accurate classification.

Benefits of technology

It improves the accuracy and consistency of data classification, provides high-quality training samples, enhances the robustness and generalization ability of the model, improves the accuracy and efficiency of real-time classification, and lowers the professional threshold.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241313A_ABST
    Figure CN122241313A_ABST
Patent Text Reader

Abstract

This application provides a data classification method, apparatus, device, and storage medium, applicable to scenarios such as data generation. The method includes: displaying a data input interface and receiving a classification rule standard text and N unclassified data points input by a first object in the data input interface; selecting P confusing data points from the N unclassified data points based on the classification rule standard text, and determining Q confusing classification pairs corresponding to these P confusing data points; displaying a rule supplement interface showing these Q confusing classification pairs; receiving supplementary classification rule text for the Q confusing classification pairs input by the first object in the rule supplement interface; and determining the classification result of the P confusing data points based on the supplementary classification rule text and the classification rule standard text. In other words, this application can improve the accuracy and efficiency of classifying difficult data, and through a visual interface, transform complex classification work into simple interactive operations, lowering the professional threshold.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a data classification method, apparatus, device, and storage medium. Background Technology

[0002] In the fields of artificial intelligence and big data analytics, data classification is a fundamental and core technical task, widely applied in model training, computer vision, intelligent recommendation, and content security. The goal of data classification is to categorize text, images, videos, or audio data into the correct categories according to established rules or standards. High-quality classification results are a crucial prerequisite for subsequent data analysis, decision support, and machine learning model training.

[0003] However, current data classification methods suffer from poor accuracy and low efficiency. For example, in model training scenarios, in order to obtain high-quality training samples, annotators typically label massive amounts of data line by line according to classification standard documents, which leads to the inability to guarantee the classification quality of some difficult data. Summary of the Invention

[0004] This application provides a data classification method, apparatus, device, and storage medium that can improve the accuracy and efficiency of data classification.

[0005] Firstly, this application provides a data classification method applied to a classification platform, the method comprising: Display a data input interface and receive the classification rule standard text and N data to be classified input by the first object in the data input interface. The data to be classified includes at least one of text data, image data, video data or audio data, and N is a positive integer. Based on the classification rule standard text, P confusing data are selected from the N data to be classified, and Q confusing classification pairs corresponding to the P confusing data are determined, where P is a positive integer less than or equal to N, and Q is a positive integer less than or equal to P. The rule supplementation interface is displayed, and the classification rule supplementation text of the Q confusion classification pairs input by the first object in the rule supplementation interface is received. The rule supplementation interface displays the Q confusion classification pairs. Based on the supplementary text of the classification rules and the standard text of the classification rules, the classification results of the P confused data are determined.

[0006] Secondly, this application provides a data classification method applied to a manual classification terminal, the method comprising: The system receives a classification request from a classification platform. The classification request includes M disputed data items and a standard text of classification rules. The classification request is used to request manual classification of the M disputed data items. The M disputed data items are selected from the N unclassified data items based on K classification results for each unclassified data item. The K classification results for the disputed data items are not completely consistent. The K classification results are obtained by classifying the unclassified data items using K first classification models based on the standard text of classification rules. The K first classification models are selected by the first object from the list of classification models included in the data input interface. The unclassified data items include at least one of text data, image data, video data, or audio data. The classification interface includes a data display area and a classification area, wherein the data display area displays the disputed data and the standard text of the classification rules; Receive the manual classification results for each piece of disputed data entered by the second object in the classification area; The manual classification results of the M disputed data are sent to the classification platform.

[0007] In some embodiments, the data input interface further includes: a classification model list and a parallel classification start option, and a confused data determination unit, specifically configured to, in response to the first object's selection operation of K first classification models in the classification model list, adjust the state of the K first classification models from an unselected state to a selected state, where K is a positive integer greater than 1; in response to the first object's triggering operation of the parallel classification start option, classify the N data to be classified based on the classification rule standard text using each of the K first classification models, obtaining K classification results for each data to be classified, where K is a positive integer greater than 1; and, based on the K classification results for each data to be classified, select the P confused data from the N data to be classified.

[0008] In some embodiments, the obfuscated data determination unit is specifically configured to: select M disputed data from the N unclassified data based on K classification results for each unclassified data, wherein the K classification results for the disputed data are not completely consistent, and M is a positive integer less than or equal to N; display a consistency analysis interface based on the M disputed data, the consistency analysis interface including allocation options; in response to the first object's triggering operation on the allocation options, send a classification request to a manual classification terminal, the classification request including the M disputed data and the classification rule standard text, the classification request being used to request manual classification of the M disputed data; obtain the manual classification results of the M disputed data from the manual classification terminal; and select the P obfuscated data from the M disputed data based on the manual classification results of the M disputed data.

[0009] In some embodiments, the data confusion determination unit is specifically used to display a cross-validation interface, the cross-validation interface including a cross-validation start option; in response to the first object's triggering operation on the cross-validation start option, the M disputed data are divided into R data subsets, where R is a positive integer greater than 1 and less than M; the second classification model is trained for R rounds based on the R data subsets, and for the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets, where the R-1 training sets are not completely the same as the R-1 training sets selected in each round of training before the i-th round, where i is a positive integer from 1 to R; using the The second classification model is trained on R-1 training sets to obtain the second classification model after the i-th training round. Using the second classification model after the i-th training round, each disputed data point in the i-th validation set is classified and predicted to obtain the predicted classification result for each disputed data point in the i-th validation set. The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th training round, and the i-th validation set is different from the validation sets in each training round before the i-th training round. Based on the predicted classification result and the manual classification result for each disputed data point in the M disputed data points, P confusing data points are selected from the M disputed data points.

[0010] In some embodiments, the confusing data determination unit is specifically configured to select T discrepancy data points from the M disputed data points where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M; for each of the T discrepancy data points, determine the confidence level of the predicted classification result of the second classification model for the discrepancy data point; and based on the confidence level corresponding to each of the T discrepancy data points, select P discrepancy data points from the T discrepancy data points whose confidence level is less than a preset threshold as the P confusing data points.

[0011] In some embodiments, the obfuscated data determination unit is specifically used to obtain the word sequence corresponding to the predicted classification result of the difference data, the word sequence including L words, where L is a positive integer; obtain the log probability corresponding to each word in the process of the second classification model generating the word sequence; sum the log probabilities corresponding to the L words to obtain a cumulative value; and perform an exponential operation on the cumulative value to obtain the confidence score of the second classification model for the predicted classification result.

[0012] In some embodiments, the cross-validation interface further includes a validation result display area, and the method further includes: in some embodiments, the obfuscated data determination unit is further configured to display validation result information in the validation result display area after selecting the P obfuscated data from the M disputed data, the validation result information including at least one of the training set and validation set selected in each round of training in R rounds of training, and cross-validation conclusion information, the cross-validation conclusion information including at least the number of the P obfuscated data.

[0013] In some embodiments, the cross-validation display interface further includes a confusion analysis option and a confusion data determination unit, specifically configured to, in response to the first object's triggering operation on the confusion analysis option, determine a confusion matrix corresponding to the P confused data based on the manual classification results and predicted classification results of the P confused data, wherein the elements in the confusion matrix... This indicates that the manual classification result is The predicted classification result is The number of data points; based on the confusion matrix, determine the Q confusion classification pairs.

[0014] In some embodiments, the rule supplementation unit is further configured to display an obfuscation analysis result interface in response to a triggering operation of the first object on the obfuscation analysis option, the obfuscation analysis result interface including the Q obfuscation classification pairs and the rule supplementation option; and to display the rule supplementation interface in response to a triggering operation of the first object on the rule supplementation option.

[0015] In some embodiments, the rule supplementation interface includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation example display area. The directory area displays Q entries corresponding one-to-one with the Q obfuscation classification pairs, and each entry has a status identifier indicating the rule supplementation completion status. The rule supplementation unit is specifically used to respond to a trigger operation by the first object on a first directory entry in the directory area that is in an unprocessed state, displaying an existing classification rule summary of the first obfuscation classification pair corresponding to the first directory entry in the existing standard display area, wherein the first obfuscation classification pair includes two easily obfuscated categories; displaying data information of at least one obfuscated data associated with the first obfuscation classification pair in the obfuscation example display area; and receiving the classification rule supplementation text input by the first object in the rule supplementation box for the first obfuscation classification pair.

[0016] In some embodiments, the rule supplementation interface further includes a submit rule option. The rule supplementation unit is also configured to, in response to the first object's triggering operation on the submit rule option, save the supplementary text of the classification rules for the first obfuscated classification pair, and change the status of the first obfuscated classification pair entry in the directory area from an unprocessed state to a processed state.

[0017] In some embodiments, the classification unit is specifically configured to display an adjudication interface, the adjudication interface including an adjudication model list and an adjudication initiation option; receive a selection operation of a target adjudication model from the first object in the adjudication model list; and, in response to a trigger operation of the first object on the adjudication initiation option, determine the classification result of each of the P obfuscated data based on the standard text of the classification rules, the supplementary text of the classification rules, and the Q obfuscated classification pairs through the target adjudication model.

[0018] In some embodiments, the classification unit is further configured to display a classification result interface, which includes summary information of the classification results of the N data to be classified.

[0019] Thirdly, this application provides a data classification apparatus, comprising: A data input unit is used to display a data input interface and receive a classification rule standard text and N data to be classified input by a first object in the data input interface. The data to be classified includes at least one of text data, image data, video data, or audio data, and N is a positive integer. The obfuscated data determination unit is used to select P obfuscated data from the N data to be classified based on the classification rule standard text, and determine Q obfuscated classification pairs corresponding to the P obfuscated data, where P is a positive integer less than or equal to N and Q is a positive integer less than or equal to P. A rule supplementation unit is used to display a rule supplementation interface and receive the classification rule supplementation text of the Q confusion classification pairs input by the first object in the rule supplementation interface, wherein the rule supplementation interface displays the Q confusion classification pairs. A classification unit is used to determine the classification result of the P confused data based on the supplementary text of the classification rules and the standard text of the classification rules.

[0020] Fourthly, this application provides a data classification apparatus, comprising: A request receiving unit is used to receive a classification request sent by a classification platform. The classification request includes M disputed data and a classification rule standard text. The classification request is used to request manual classification of the M disputed data. The M disputed data are selected from the N unclassified data based on K classification results for each unclassified data. The K classification results of the disputed data are not completely consistent. The K classification results are obtained by classifying the unclassified data using K first classification models based on the classification rule standard text. The K first classification models are selected by the first object from the list of classification models included in the data input interface. The unclassified data includes at least one of text data, image data, video data, or audio data. A display unit is used to display a classification interface, which includes a data display area and a classification area. The data display area displays the disputed data and the standard text of the classification rules. The classification result receiving unit is used to receive the manual classification result of each disputed data entered by the second object in the classification area; The sending unit is used to send the manual classification results of the M disputed data to the classification platform.

[0021] In some embodiments, the classification candidate list further includes a manual input option. The classification result receiving unit is also configured to display a classification input box in response to a triggering operation of the second object on the manual input option; and to receive the manual classification result of the first disputed data entered by the second object in the classification input box.

[0022] In some embodiments, the data display area includes the first disputed data and the classification rule standard corresponding to the first disputed data in the classification rule standard text. The classification interface also includes a next option. The classification result receiving unit is further configured to, in response to the second object's triggering operation on the next option, display the second disputed data and the classification rule standard corresponding to the second disputed data in the classification rule standard text in the data display area; and display a classification candidate list of the second disputed data in the classification area, the classification candidate list including different classification results among the K classification results of the second disputed data.

[0023] Fifthly, this application provides an electronic device including a processor and a memory. The memory is used to store a computer program, and the processor is used to invoke and run the computer program stored in the memory to perform the methods described in the first or second aspect above.

[0024] In a sixth aspect, a chip is provided for implementing the methods of various implementations of the first aspect described above. Specifically, the chip includes a processor for retrieving and running a computer program from a memory, causing a device equipped with the chip to perform the methods of the first or second aspect described above.

[0025] In a seventh aspect, a computer-readable storage medium is provided for storing a computer program that causes a computer to perform the methods described in the first or second aspect.

[0026] Eighthly, a computer program product is provided, including computer program instructions that cause a computer to perform the methods described in the first or second aspect.

[0027] Ninthly, a computer program is provided that, when run on a computer, causes the computer to perform the methods of the first or second aspect described above.

[0028] In summary, this application proposes a novel data classification method. A classification platform displays a data input interface and receives classification rule standard text and N unclassified data points input by a first object in the data input interface. The unclassified data in this embodiment includes at least one of text data, image data, video data, or audio data. Next, based on the classification rule standard text, the classification platform selects P obfuscated data points from the N unclassified data points and determines Q obfuscated classification pairs corresponding to these P obfuscated data points. Then, the classification platform displays a rule supplement interface showing these Q obfuscated classification pairs. The classification platform receives supplementary classification rule text for the Q obfuscated classification pairs input by the first object in the rule supplement interface, and then determines the classification result of the P obfuscated data points based on the supplementary classification rule text and the classification rule standard text. Therefore, this application embodiment develops a visual classification platform. A first object (such as a product operator or domain expert) can directly input the standard text of classification rules and the data to be classified on this platform. The platform can then automatically filter out obfuscated data from N data points based on the standard text of classification rules, and display a rule supplement interface based on the selected obfuscated data. The first object can input supplementary text of classification rules for obfuscated classification pairs in this supplementary interface, thus establishing clear distinction criteria for each specific obfuscated classification pair. In this way, the classification platform can accurately classify obfuscated data based on the supplementary text and the standard text of classification rules. That is, this application embodiment solves the problem of obfuscated data classification through a closed-loop approach of identifying obfuscation, supplementing rules, and accurate classification, improving the accuracy and consistency of data classification. For training sample annotation scenarios, the method of this application embodiment can provide higher-quality training samples for supervised machine learning models, enabling the finally trained model to have stronger robustness and generalization ability when handling edge cases and complex data, thereby improving model performance. For real-time classification scenarios, the method in this application embodiment can accurately classify text, images, and videos, improving the accuracy and efficiency of content classification and review. Furthermore, the method in this application embodiment allows the classification platform to automatically classify simple data within the data to be classified, with the first object only supplementing rules for a small amount of obfuscated data. This saves the first object a significant amount of time spent manually classifying and reviewing simple data, thereby improving data classification efficiency. Further, the classification platform in this application embodiment guides the first object (e.g., product operators or domain experts) through a data input interface and a rule supplementation interface, enabling them to complete the entire process from standard setting to rule optimization. This exposes hidden quality risks, transforming complex knowledge engineering into simple interactive operations and lowering the professional threshold. Attached Figure Description

[0029] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0030] Figure 1 A schematic diagram illustrating the implementation environment of a data classification method provided in this application embodiment; Figure 2 A flowchart illustrating a data classification method provided in an embodiment of this application; Figure 3A and Figure 3B This is a schematic diagram of the data input interface; Figure 4 and Figure 5 This is another schematic diagram of the data input interface; Figure 6 A schematic diagram of selecting the first classification model for the first object; Figure 7 A schematic diagram illustrating the classification of data using K first-order classification models; Figure 8 This is a schematic diagram of a consistency analysis interface; Figure 9A and Figure 9B A schematic diagram illustrating the connection between the classification platform and the manual classification terminal; Figure 10 , Figure 11 and Figure 12 Here are some schematic diagrams of the category interface; Figure 13 This is a schematic diagram of a cross-validation interface; Figure 14 A schematic diagram illustrating the predicted classification result of disputed data using a second classification model; Figure 15 This is another schematic diagram of the cross-validation interface; Figure 16 A trigger diagram for obfuscating the analysis results interface; Figure 17 A schematic diagram of the interface for supplementing rules; Figure 18 Another illustration of the interface for supplementing rules; Figure 19 This is a schematic diagram of the adjudication interface; Figure 20 This is another schematic diagram of the adjudication interface; Figure 21 A schematic diagram of the classification result boundary; Figure 22 This is another schematic diagram of the classification result boundary; Figure 23 A flowchart illustrating a data classification method provided in an embodiment of this application; Figure 24 A flowchart illustrating a data classification method provided in an embodiment of this application; Figure 25 This is a schematic block diagram of a data classification device provided in an embodiment of this application; Figure 26 This is a schematic block diagram of a data classification device provided in an embodiment of this application; Figure 27 This is a schematic block diagram of the electronic device provided in the embodiments of this application. Detailed Implementation

[0031] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0032] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. In embodiments of the invention, "B corresponding to A" means that B is associated with A. In one implementation, B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices. In the description of this application, unless otherwise stated, "a plurality of" means two or more.

[0033] The data classification method provided in this application can be applied to various fields such as data classification and sample labeling to improve the accuracy and efficiency of data classification.

[0034] To facilitate understanding of the embodiments of this application, the relevant concepts involved in the embodiments of this application will first be introduced: LLM (Large Language Model) is a deep learning-based natural language processing model that learns the statistical patterns and knowledge of language through pre-training on massive amounts of text data, thereby enabling it to understand and generate human language. LLM is a generative natural language processing model that can understand input instructions and generate answers accordingly.

[0035] Prompt: A large model instruction, described in natural language, outlining the task the large model is expected to perform, along with auxiliary information provided to the large model. The large model then follows the instructions to generate the corresponding answer.

[0036] SFT (Supervised Fine-Tuning): By providing appropriate prompts and corresponding responses, the parameters of a large model are trained and then fine-tuned to obtain a large model that performs better on the specified data.

[0037] Token sequences are a core concept in Natural Language Processing (NLP) and Large Language Modeling (LLM), referring to the numerical sequences converted from text (such as words, characters, or subwords). Simply put, it's a standardized data format that the model can "understand" and "process." In this embodiment, when generating classification labels for data to be classified, the model may output multiple tokens one by one. For example, the classification label "Finance / Business Company" may be decomposed into multiple tokens: ["Finance", " / ", "Business", "Company"].

[0038] In the field of machine learning, especially in supervised learning tasks, high-quality training samples are a key factor determining the upper limit of model performance. Data annotation, as a core step in transforming raw data into a standard format usable for model training, directly affects the accuracy and generalization ability of the final model. With the widespread application of artificial intelligence technology in various industries, the demand for data annotation is growing exponentially, covering various task types such as text classification, image recognition, object detection, and speech recognition.

[0039] Current data classification methods suffer from poor accuracy and low efficiency. For example, in model training scenarios, to obtain high-quality training samples, annotators typically label massive amounts of data line by line according to classification standard documents. This process mainly involves: first, domain experts or product operations personnel develop classification rule standard documents; then, annotators are trained to understand and memorize these classification rules; next, annotators perform large-scale manual annotation on the massive dataset; and finally, the classification results are checked and corrected for consistency through sampling quality control or cross-validation. However, current manual annotation methods are prone to subjective bias due to individual differences in the understanding of classification rule standards among different annotators, even with standardized training. For data with ambiguous boundaries or complex semantics, different annotators may provide different classification results, leading to low annotation accuracy. Furthermore, current methods require significant investment of manpower and time. As the amount of data to be labeled increases, the annotation cycle lengthens significantly, resulting in low efficiency and difficulty in meeting the training needs of rapidly iterating machine learning models.

[0040] To address the aforementioned technical problems, this application proposes a novel data classification method. First, a classification platform displays a data input interface and receives a classification rule standard text and N unclassified data points input by a first object in this interface. The unclassified data in this embodiment includes at least one of text data, image data, video data, or audio data. Next, based on the classification rule standard text, the classification platform selects P obfuscated data points from the N unclassified data points and determines Q obfuscated classification pairs corresponding to these P obfuscated data points. Then, the classification platform displays a rule supplement interface showing these Q obfuscated classification pairs. The classification platform receives supplementary classification rule text for the Q obfuscated classification pairs input by the first object in the rule supplement interface, and then determines the classification result of the P obfuscated data points based on the supplementary classification rule text and the classification rule standard text. Therefore, this application embodiment develops a visual classification platform. A first object (such as a product operator or domain expert) can directly input the standard text of classification rules and the data to be classified on this platform. The platform can then automatically filter out obfuscated data from N data points based on the standard text of classification rules, and display a rule supplement interface based on the selected obfuscated data. The first object can input supplementary text of classification rules for obfuscated classification pairs in this supplementary interface, thus establishing clear distinction criteria for each specific obfuscated classification pair. In this way, the classification platform can accurately classify obfuscated data based on the supplementary text and the standard text of classification rules. That is, this application embodiment solves the problem of obfuscated data classification through a closed-loop approach of identifying obfuscation, supplementing rules, and accurate classification, improving the accuracy and consistency of data classification. For training sample annotation scenarios, the method of this application embodiment can provide higher-quality training samples for supervised machine learning models, enabling the finally trained model to have stronger robustness and generalization ability when handling edge cases and complex data, thereby improving model performance. For real-time classification scenarios, the method in this application embodiment can accurately classify text, images, and videos, improving the accuracy and efficiency of content classification and review. Furthermore, the method in this application embodiment allows the classification platform to automatically classify simple data within the data to be classified, with the first object only supplementing rules for a small amount of obfuscated data. This saves the first object a significant amount of time spent manually classifying and reviewing simple data, thereby improving data classification efficiency. Further, the classification platform in this application embodiment guides the first object (e.g., product operators or domain experts) through a data input interface and a rule supplementation interface, enabling them to complete the entire process from standard setting to rule optimization. This exposes hidden quality risks, transforming complex knowledge engineering into simple interactive operations and lowering the professional threshold.

[0041] The data classification method provided in this application can improve the accuracy and efficiency of data classification. The method in this application can be applied to at least the following scenarios: Scenario 1, a classification scenario, refers to the application of the method in this embodiment of the application to classification platforms that require expert knowledge intervention, such as content review and classification platforms. In this case, the classification platform in this embodiment of the application can be understood as a content review and classification platform. For example, social media, short video platforms, and news clients need to classify (e.g., entertainment, technology) and conduct security reviews (e.g., illegal content, advertisements) on massive amounts of text, images, and videos. In Scenario 1, the method in this embodiment of the application can be used to accurately identify confusing data, thereby determining confusing classification pairs, such as "Sports-NBA" and "Sports-CBA," "mold" and "technology," "sarcasm" and "normal comments," etc., which are highly easily confused data pairs. After product operators or domain experts supplement the classification rules for these confusing classification pairs, the platform can automatically, efficiently, and accurately complete the review and classification of all content, greatly reducing the pressure of manual review and the risk of missed judgments. In other words, in Scenario 1, when the platform encounters new data with high confusion, it can trigger expert intervention through the rule supplementation interface. The rules supplemented by the experts can update the classification logic in real time, thereby continuously improving the platform's ability to classify unknown and complex data, achieving self-evolution in classification. For example, in text classification, texts (such as news, comments, and product descriptions) are divided into predefined categories (such as sports, technology, and negative reviews), including news categorization, sentiment analysis, and spam detection. As another example, in recommendation systems, user-generated content (UGC) or product descriptions are categorized, and key tags (such as "electronic products / mobile phones") are extracted to help the recommendation system accurately match user interests. Thus, in the field of cold start, the method of this application embodiment can achieve rapid categorization of new content, solving the problem of new items / users lacking historical behavior in recommendation systems. In the field of personalized ranking, the method of this application embodiment can combine category tags (such as user preference "technology news") to optimize the ranking strategy of the recommendation list.

[0042] Scenario 2, training sample annotation scenario, can utilize the method of this application embodiment to automatically and accurately classify (annotate) large-scale training samples through human-machine collaboration, providing a high-quality dataset for model training. In one example, training a visual model for autonomous vehicles requires accurate identification of various objects on the road (such as pedestrians, vehicles, and traffic signs); intelligent security requires judging abnormal behavior in detected videos. Traditional annotation is prone to errors for small targets at a distance, partially occluded objects, and images under special lighting conditions. Using the method of this application embodiment, these "confused samples" (e.g., whether it's a "plastic bag" or a "small animal") can be automatically filtered out, and experts can supplement the recognition rules for specific scenarios (such as judgment based on motion trajectory), thereby training a visual model with higher recognition accuracy and reliability. In one example, an AI model is trained based on medical images (such as CT, MRI, and X-rays) to assist doctors in detecting lesions (such as tumors and nodules). Early benign nodules and malignant tumors may have subtle and easily confused features on images. The method described in this application can filter out highly confusing image sample pairs from massive amounts of images, and then have senior radiology experts supplement the differentiation rules (such as based on edge morphology, calcification distribution, etc.). This not only generates high-quality labeled data, but also transforms the expert's diagnostic experience into quantifiable model parameters, improving the accuracy and reliability of AI-assisted diagnosis.

[0043] The implementation environment of the embodiments of this application is described below.

[0044] Figure 1 A schematic diagram illustrating the implementation environment of a data classification method provided in this application embodiment, as shown below. Figure 1 As shown, the implementation environment includes: classification platform 101.

[0045] In some embodiments, such as Figure 1 As shown, the classification platform 101 includes a terminal device 101-a and a server 101-b, wherein the terminal device 101-a and the server 101-b are connected by wired or wireless means.

[0046] For example, the terminal device 101-a can display the operable interface of the classification platform. The server 102 can be understood as the server-side or backend of the classification platform. The terminal device 101-a is used to interact with the first object, receive data and instructions input by the first object, and display the data processing results to the first object. The server 101-b is used to execute specific data processing steps.

[0047] In some embodiments, the method of this application embodiment can be collaboratively performed by terminal device 101-a and server 101-b. For example, a first object can start the classification platform through terminal device 101-a, which displays a data input interface. The first object can input classification rule standard text and N data to be classified in the data input interface. For example, the data input interface includes a rule input box and a data input box. The first object can input classification rule standard text in the rule input box and input N data to be classified in the data input box. These N data to be classified can be at least one of text data, image data, video data, or audio data. Then, terminal device 101-a sends the classification rule standard text and N data to be classified input by the first object to server 101-b. Based on the classification rule standard text, server 101-b selects P confusing data from the N data to be classified and determines Q confusing classification pairs corresponding to these P confusing data, where P is a positive integer less than or equal to N and Q is a positive integer less than or equal to P. Server 101-b sends the Q obfuscation classification pairs corresponding to the P obfuscated data points to terminal device 101-a. Terminal device 101-a displays a rule supplement interface, which shows the Q obfuscation classification pairs. The first object enters supplementary text for the classification rules of these Q obfuscation classification pairs in this supplementary interface. Terminal device 101-a sends the supplementary text for the classification rules of the Q obfuscation classification pairs entered by the first object to server 101-b. Server 101-b determines the classification result of the P obfuscated data points based on the supplementary text and the standard text of the classification rules. Optionally, server 101-b sends the determined classification result of the P obfuscated data points to terminal device 101-a, and terminal device 101-a displays the classification result of these P obfuscated data points.

[0048] In some embodiments, the method of this application embodiment can be performed independently by terminal device 101-a. For example, a first object can start the classification platform through terminal device 101-a, which displays a data input interface. The first object can input the standard text of classification rules and N data to be classified in the data input interface. For example, the data input interface includes a rule input box and a data input box. The first object can input the standard text of classification rules in the rule input box and input N data to be classified in the data input box. These N data to be classified can be at least one of text data, image data, video data, or audio data. Next, based on the standard text of classification rules, terminal device 101-a selects P confusing data from the N data to be classified and determines Q confusing classification pairs corresponding to these P confusing data, where P is a positive integer less than or equal to N and Q is a positive integer less than or equal to P. Then, terminal device 101-a displays a rule supplement interface, which displays the Q confusing classification pairs. The first object inputs the supplementary text of the classification rules for these Q confusing classification pairs in the rule supplement interface. Terminal device 101-a determines the classification results for P obfuscated data points based on the supplementary text and standard text of the classification rules. Optionally, terminal device 101-a displays the classification results for these P obfuscated data points.

[0049] Therefore, this application embodiment develops a visual classification platform 101. A first object (such as a product operator or domain expert) can directly input the standard text of classification rules and the data to be classified on this platform. The platform can then automatically filter out obfuscated data from N data points based on the standard text of classification rules, and display a rule supplement interface based on the selected obfuscated data. The first object can input supplementary text of classification rules for obfuscated classification pairs in this supplementary interface, thus establishing clear distinction criteria for each specific obfuscated classification pair. In this way, the classification platform can accurately classify obfuscated data based on the supplementary text and the standard text of classification rules. That is, this application embodiment solves the problem of obfuscated data classification through a closed-loop approach of identifying obfuscation, supplementing rules, and accurate classification, improving the accuracy and consistency of data classification. For training sample annotation scenarios, the method of this application embodiment can provide higher-quality training samples for supervised machine learning models, enabling the finally trained model to have stronger robustness and generalization ability when handling edge cases and complex data, thereby improving model performance. For real-time classification scenarios, the method in this application embodiment can accurately classify text, images, and videos, improving the accuracy and efficiency of content classification and review. Furthermore, the method in this application embodiment allows the classification platform to automatically classify simple data within the data to be classified, with the first object only supplementing rules for a small amount of obfuscated data. This saves the first object a significant amount of time spent manually classifying and reviewing simple data, thereby improving data classification efficiency. Further, the classification platform in this application embodiment guides the first object (e.g., product operators or domain experts) through a data input interface and a rule supplementation interface, enabling them to complete the entire process from standard setting to rule optimization. This exposes hidden quality risks, transforming complex knowledge engineering into simple interactive operations and lowering the professional threshold.

[0050] In some embodiments, the terminal device 101-a described above includes, but is not limited to, desktop computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices may be devices equipped with cameras and display devices, such as smart speakers, smart TVs, smart air conditioners, and smart in-vehicle systems. Portable wearable devices may be devices equipped with cameras and display devices, such as smartwatches, smart bracelets, and head-mounted devices. Terminal devices are often equipped with display devices, which may also be monitors, displays, touchscreens, etc., and touchscreens may also be touchscreens, touch panels, etc.

[0051] In some embodiments, the server 101-b described above can be one or more servers. When there are multiple servers, at least two servers are used to provide different services, and / or at least two servers are used to provide the same service, such as providing the same service in a load-balanced manner. This application embodiment does not limit this. The server described above can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. The server can also become a node in a blockchain.

[0052] It should be noted that the implementation environment of this application embodiment includes, but is not limited to, Figure 1 As shown.

[0053] The technical solutions of the embodiments of this application will be described in detail below through some examples. The following embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0054] Figure 2 This is a schematic flowchart of a data classification method provided in an embodiment of this application.

[0055] like Figure 2 As shown, the data classification method of this application embodiment includes the following steps: S101. Display the data input interface and receive the classification rule standard text and N data to be classified from the first object in the data input interface.

[0056] The data to be classified includes at least one of text data, image data, video data, or audio data, where N is a positive integer greater than 1.

[0057] It should be noted that the data involved in the execution of the embodiments of this application, as well as the methods of obtaining such data, comply with relevant laws and regulations.

[0058] As described above, the data classification method provided in this application can be applied to classification scenarios to perform content classification (such as entertainment and technology) and security review (such as illegal content and advertisements) on massive amounts of text, images, and videos. It can also be applied to training sample annotation scenarios to achieve accurate annotation of large-scale training samples, providing a high-quality dataset for model training.

[0059] In this embodiment of the application, a single piece of data to be classified is recorded as a piece of data to be classified. For example, a piece of text data to be classified, or an image to be classified, or a video segment to be classified, or an audio segment to be classified, etc., is recorded as a piece of text data to be classified.

[0060] The data to be classified in this application embodiment can also be referred to as a sample to be classified, or data to be processed. This application embodiment does not limit the specific data type of the data to be classified. For example, the data to be classified includes at least one of text data, image data, video data, or audio data.

[0061] For example, in a classification scenario, the data to be classified can be understood as any data waiting to be classified, such as text data, image data, video data, audio data, etc.

[0062] For example, in the training sample annotation scenario, the data to be classified can be understood as any training sample in the training set that needs to be labeled, such as text data, image data, video data, or audio data, etc.

[0063] It should be noted that the data types of the N data to be classified in this application embodiment can be the same or different, and this application embodiment does not impose any restrictions on this.

[0064] In one example, the N data to be classified are either all text data, all image data, all video data, or all audio data.

[0065] In another example, the N data to be classified may include multiple different types of data, such as at least two types of data including text data, video data, image data, audio data, etc.

[0066] For example, the N data to be classified in this application embodiment can be 100,000 text or video descriptions.

[0067] The classification rule standard text in this application embodiment can be understood as an initial set of rules formulated by the first object based on business needs to guide data classification. For example, it includes category names (e.g., "news," "social," "advertising") and category descriptions; for instance, "financial category" is defined as "text involving funds, exchange rates, company financial reports, etc."

[0068] The classification platform in this application embodiment includes a terminal device. A first object can directly interact with the terminal device. The first object can be understood as a person familiar with classification rules, such as the creator of the classification rule annotation text, and can supplement relevant classification rules. For example, the first object can be a product operation personnel or a domain expert.

[0069] In this embodiment, the first object can access the classification platform through a terminal device. This embodiment does not limit the specific type of the classification platform. In one example, in a classification scenario, the classification platform in this embodiment can be a content moderation and classification platform, such as a social media platform, video platform, or news information platform. In one example, in a training sample annotation scenario, the classification platform can be a sample annotation platform or a model training platform.

[0070] This application embodiment does not impose restrictions on the triggering conditions for the terminal device to display the data input interface.

[0071] In one possible implementation, a client for the classification platform is installed on the terminal device, and the first object triggers the client. In response to the first object's triggering operation on the client, the terminal device displays a data input interface.

[0072] In one possible implementation, the classification platform exists as a website service, eliminating the need for dedicated client software on terminal devices. The first user accesses the platform's Uniform Resource Locator (URL) address through a mainstream browser on their terminal device (such as a personal computer, laptop, or tablet). After authentication, they can enter the user interface. For example, when the first user enters the classification platform's URL address in their browser, the terminal device responds by displaying a login interface. This login interface includes a username input field and a password input field. The first user enters their username in the username field and their password in the password field. After verifying that the username and password are correct, the terminal device displays a data input interface.

[0073] In this embodiment of the application, the first object can input the standard text of the classification rules and N data to be classified in the data input interface.

[0074] This application embodiment does not limit the specific layout of the data input interface, as long as the first object can input the standard text of the classification rules and N data to be classified. For example, the data input interface includes an input box, in which the first object can input the standard text of the classification rules and N data to be classified.

[0075] In some embodiments, such as Figure 3A As shown, the data input interface in this embodiment includes a classification rule input box and a data input box.

[0076] This application embodiment does not limit the specific method by which the first object inputs the standard text of the classification rule in the classification rule input box. In one method, the first object can drag and drop the pre-defined standard text of the classification rule into the classification rule input box to complete the input of the standard text of the classification rule. In another method, the first object triggers the classification rule input box, and the terminal device responds to the first object's triggering operation on the classification rule input box by displaying a file list on the terminal device. The first object selects the standard text of the classification rule in the file list and clicks the upload option. The terminal device responds to the first object's triggering operation on the upload option by uploading the standard text of the classification rule selected by the first object. After successful upload, as follows... Figure 3B As shown, the text "Uploaded" or "Uploaded successfully" is displayed in the category rule input box.

[0077] This application embodiment does not limit the specific method by which the first object inputs the data to be classified into the data input box. In one method, the first object can drag and drop the dataset to be classified (which includes N data items to be classified) into the data input box to complete the input of N data items to be classified. In another method, the first object triggers the data input box, and the terminal device responds to the trigger operation of the first object on the data input box by displaying a file list on the terminal device. The first object selects the dataset to be classified in the file list and clicks the upload option. The terminal device responds to the trigger operation of the first object on the upload option by uploading the dataset to be classified selected by the first object. After successful upload, as... Figure 3B As shown, the data input box displays the message "Uploaded" or "Uploaded successfully".

[0078] In some embodiments, the first object can also input the classification rule standard text and N data to be classified in the data input interface via voice control.

[0079] It should be noted that the above Figure 3A The classification rule input box and data input box shown are merely examples. This application does not limit the specific location of the classification rule input box and data input box within the data input interface.

[0080] In one example, the classification rule input box can be located at the top of the data input box.

[0081] In one example, the classification rule input box can be located at the bottom of the data input box.

[0082] In one example, the classification rule input box can be located to the left of the data input box.

[0083] In one example, the classification rule input box can be located to the right of the data input box.

[0084] In this embodiment of the application, after the classification platform receives the classification rule standard text and N data to be classified from the input of the first object, it executes the following step S102.

[0085] S102. Based on the standard text of classification rules, select P confusing data from N data to be classified, and determine Q confusing classification pairs corresponding to the P confusing data.

[0086] Where P is a positive integer less than or equal to N, and Q is a positive integer less than or equal to P.

[0087] This application embodiment does not impose restrictions on the triggering conditions for the classification platform to perform the above S102 step.

[0088] In one possible implementation, when the classification platform detects that the first object has successfully entered the classification rule standard text and N data to be classified in the data input interface, it automatically executes the steps of S102 above.

[0089] In one possible implementation, such as Figure 4 As shown, the data input interface in this embodiment of the application also includes a startup option. After successfully uploading the classification rule standard text and N pieces of data to be classified, the first object triggers this startup option. In response to the first object's triggering of this startup option, the classification platform executes the steps of S102 described above.

[0090] In some embodiments, the steps of S102 described above can be performed by a terminal device.

[0091] In some embodiments, the classification platform of this application further includes a server, in which case the steps of S102 described above can be performed by the server. In this case, the terminal device needs to send the classification rule standard text entered by the first object in the input interface and N pieces of data to be classified to the server.

[0092] The following section uses the classification platform as the executing entity as an example to introduce the specific implementation process of the above S102.

[0093] In this embodiment of the application, after the classification platform obtains the classification rule standard text input by the first object and N data to be classified, it selects obfuscated data from the N data to be classified based on the classification rule standard text, for example, selecting P obfuscated data.

[0094] This application does not limit the specific method by which the classification platform selects P confusing data from N unclassified data in the embodiments.

[0095] In some embodiments, the classification platform can classify the N unclassified data using a pre-trained classification model, and then filter out P confusing data based on the confidence level of the classification results. For example, for each of the N unclassified data, the classification platform inputs the unclassified data and the standard text of the classification rules into the classification model. The classification model classifies the unclassified data based on the standard text of the classification rules, outputs the classification result of the unclassified data, and outputs the confidence level of the classification model for the classification result. In this way, the classification platform can select unclassified data with a confidence level lower than a preset value from the N unclassified data based on the confidence level of the classification result of each unclassified data in the classification model output, and record them as confusing data, thus obtaining P confusing data.

[0096] In some embodiments, the classification platform may select P confusing data from N data to be classified through the following steps S102-A to S102-C: S102-A: In response to the selection operation of K first classification models in the classification model list of the first object, the state of the K first classification models is changed from unselected state to selected state; S102-B, In response to the triggering operation of the first object on the parallel classification start option, classify the N data to be classified based on the classification rule standard text through each of the K first classification models, and obtain K classification results for each data to be classified, where K is a positive integer greater than 1; S102-C: Based on the K classification results of each unclassified data, select P confusing data from the N unclassified data.

[0097] In this implementation, such as Figure 5 As shown, the data input interface in this embodiment further includes a classification model list and a parallel classification start option. The classification model list includes multiple different classification models. These multiple different first classification models are pre-trained classification models capable of classifying data. This embodiment does not select the specific type of the first classification models included in the classification model list; it can be any number of models with data classification capabilities.

[0098] The first object can select K first classification models from the list of classification models. Here, K is a positive integer greater than 1, and the number of first classification models selected by the first object is no less than 2.

[0099] In one example, a constraint condition for K is also displayed near the classification model list position in the data input interface. For example, as shown in Figure 5, "Select classification model (≥3)" is displayed above the classification model list position, that is, K is a positive integer greater than or equal to 3.

[0100] In this embodiment of the application, the first object selects K first classification models from the classification model list, for example... Figure 6 As shown, after the first object selects 3 first classification models (for example, selecting 3 first classification models from 4 first classification models), the terminal device responds to the first object's selection operation of K first classification models by changing the state of these K first classification models from unselected to selected. For example Figure 6 As shown, the state of the three first classification models selected by the first object is changed from the unselected state (i.e., the circle option box does not display a dot) to the selected state (i.e., the circle option box displays a dot).

[0101] It should be noted that the embodiments of this application do not limit the specific display style of the classification model list in the data input interface, except for the following: Figure 5 and 6 In addition to the horizontal list format shown, it can also be displayed as a drop-down list, or in other possible ways.

[0102] This application embodiment does not restrict the specific distribution of the classification model list, rule input box, and data input box in the data input interface. For example, Figure 5 and Figure 6 As shown, the classification model list can be located at the top of the rule input box and the data input box. Optionally, the classification model list can also be located at the bottom of the rule input box and the data input box, or in other locations on the data input interface.

[0103] In the embodiments of this application, such as Figure 6 As shown, the first object enters the standard text of the classification rules in the rule input box, enters N data points to be classified in the data input box, and selects K first classification models in the classification model list, then triggers the parallel classification start option. In response to the first object's triggering of the parallel classification start option, the classification platform classifies the N data points based on the standard text of the classification rules using each of the K first classification models, obtaining K classification results for each data point, where K is a positive integer greater than 1.

[0104] In some embodiments, if the terminal device has the above-mentioned K first classification models deployed locally, the terminal device responds to the triggering operation of the first object to the parallel classification start option, calls each of the K first classification models deployed locally, and performs classification processing on each of the N data to be classified based on the classification rule standard text, to obtain K classification results for each data to be classified.

[0105] In some embodiments, if the terminal device does not have the aforementioned K first classification models deployed locally, the terminal device, in response to the first object's triggering operation for the parallel classification initiation option, sends a request message to the server. This request message includes the classification rule standard text uploaded by the first object, N data points to be classified, and the identification information of the selected K first classification models. The server then parses the request message to obtain the identification information of the K first classification models, and based on this identification information, invokes the K first classification models to classify the N data points according to the classification rule standard text, obtaining K classification results for each of the N data points.

[0106] In the embodiments of this application, such as Figure 7 As shown, for each of the N unclassified data points, the classification platform can input the unclassified data and the standard text of the classification rules into K first classification models (e.g., first classification model A, first classification model B, and first classification model C). For each of these K first classification models, the first classification model classifies the unclassified data based on the standard text of the classification rules and outputs the classification result. Thus, for the K first classification models, K classification results (classification result 1, classification result 2, and classification result 3) can be obtained for the unclassified data.

[0107] Next, based on the K classification results of each of the N unclassified data, the classification platform selects P confusing data from these N unclassified data.

[0108] This application embodiment does not limit the specific method by which the classification platform selects P confusing data from the N unclassified data based on the K classification results of each of the N unclassified data.

[0109] In some embodiments, for each of the N unclassified data points, if all K classification results for that unclassified data point are different, then that unclassified data point is determined to be confusing data. In this way, P confusing data points can be selected from the N unclassified data points.

[0110] In some embodiments, the classification platform selects P confusing data from N unclassified data based on K classification results for each unclassified data, including the following steps S102-C1 and S102-C2: S102-C1. Based on the K classification results of each unclassified data, select M disputed data from the N unclassified data. The K classification results of the disputed data are not completely consistent, and M is a positive integer less than or equal to N. S102-C2: Based on M disputed data, display the consistency analysis interface, which includes allocation options; S102-C3, In response to the triggering operation of the first object pair allocation option, a classification request is sent to the terminal device. The classification request includes M disputed data and classification rule standard text. The classification request is used to request manual classification of the M disputed data. S102-C4: Obtain the manual classification results of M disputed data from the manual classification terminal; S102-C5. Based on the manual classification results of M disputed data, select P confusing data from the M disputed data.

[0111] In this implementation, the classification platform first performs a consistency analysis on the N unclassified data points based on the K classification results for each of the N unclassified data points, selecting M disputed data points from these N unclassified data points. The K classification results for disputed data points are not entirely consistent. For example, for each of the N unclassified data points, if the K classification results for that data point are not entirely consistent, then that data point is identified as disputed data. For instance, if K equals 3, the first classification model predicts the classification result of data point 1 as "model," the second classification model predicts the classification result as "technology," and the third classification model predicts the classification result as "model." The first and third classification models predict the same result, but the result is inconsistent with the second classification model; therefore, data point 1 is identified as disputed data. Following this method, M disputed data points can be selected from N unclassified data points.

[0112] Next, as Figure 8 As shown, the terminal devices in the classification platform display a consistency analysis interface based on the M disputed data points. This consistency analysis interface includes allocation options.

[0113] In some embodiments, such as Figure 8 As shown, the consistency analysis interface also includes the aforementioned consistency analysis results. Specifically, it performs consistency analysis on the N unclassified data based on the K classification results for each of the N unclassified data. If all K classification results for a given data point are consistent, then that data point is identified as having consistent classification results. If the K classification results for that data point are not entirely consistent, then that data point is identified as having inconsistent classification results. This process identifies NM unclassified data points with consistent classification results and M disputed data points with inconsistent classification results. For example, as shown... Figure 8As shown, for the unclassified data with K consistent classification results, the status can be marked as "consistent". For the unclassified data with K inconsistent classification results, the status can be marked as "disputed". In this way, the first object can use this consistency analysis interface to view the consistency information of the classification results of the K first classification models for the N unclassified data.

[0114] In some embodiments, such as Figure 8 As shown, the consistency analysis interface in this embodiment also includes an option to export consistent data. When it is necessary to export NM unclassified data with consistent classification results, the first object can trigger the export consistent data option. In response to the first object's triggering operation of the export consistent data option, the terminal device exports the classification result of each of the NM unclassified data to the local terminal device. In this way, the classification results of NM unclassified data out of N unclassified data can be obtained.

[0115] In the embodiments of this application, such as Figure 8 As shown, the first object triggers the allocation option. In response to the first object's triggering operation on the allocation option, the classification platform sends a classification request to the manual classification terminal. The classification request includes M disputed data and standard text of classification rules. The classification request is used to request manual classification of the M disputed data.

[0116] In some scenarios, such as Figure 9A In this embodiment of the application, the classification platform is directly connected to the manual classification terminal, so that the classification platform can directly send the classification request to the manual classification terminal.

[0117] In some scenarios, such as Figure 9B In this embodiment of the application, the classification platform is communicatively connected to a human platform, which in turn is communicatively connected to at least one human classification terminal. Thus, the classification platform in this embodiment can first send the classification request to the human platform, which then sends the classification request to the human classification terminal. In one example, the human platform sends the classification request to a human classification terminal that is currently idle (or has a low workload). In another example, to improve the efficiency of human classification, such as... Figure 9B As shown, the artificial intelligence platform can parse the above classification request to obtain M disputed data and classification rule standard text. Then, it divides the M disputed data into multiple disputed data groups, assigns each disputed data group and classification rule standard text to an artificial intelligence classification terminal, and then assigns the M disputed data to multiple artificial intelligence classification terminals for parallel classification to improve the efficiency of artificial classification.

[0118] For each manual classification terminal that assigns disputed data, the second object corresponding to that manual classification terminal (such as the classification personnel corresponding to that manual classification terminal) starts the manual classification task. Then, the classification terminal displays the classification interface, and the second object can manually classify the disputed data in the classification interface, that is, manually mark the type of the disputed data.

[0119] like Figure 10 As shown, the classification interface includes a data display area and a classification area. The data display area shows disputed data and standard text of classification rules. The classification area is an operable area for the second object, used to input manual classification results. It should be noted that this embodiment does not limit the specific location of the data display area and classification area within the classification interface.

[0120] In one example, the data display area can show multiple disputed data entries at a time, along with the corresponding classification rule standards in the classification rule standard text. This allows a second object to mark the classification results of these multiple disputed data entries within the classification area.

[0121] In one example, the data display area shows one disputed data point at a time, along with the corresponding classification rule standard in the classification rule standard text. This allows the second object to mark the classification result of this disputed data point in the classification area. If it is necessary to mark the next disputed data point, the second object can trigger the next option. The manual classification terminal, in response to the second object's triggering of the next option, displays the next disputed data point and its corresponding classification rule standard in the classification rule standard text in the data display area. Alternatively, the manual classification terminal automatically displays the next disputed data point and its corresponding classification rule standard in the classification rule standard text in the data display area after detecting that the second object has finished marking the current disputed data point.

[0122] In some embodiments, to reduce the complexity of manual classification and improve its efficiency and consistency, the classification request in this application embodiment further includes K classification results for each of the M disputed data. Thus, for each of the M disputed data, such as the first disputed data, the manual classification terminal device displays the first disputed data in the data display area and a classification candidate list for the first disputed data in the classification area. This classification candidate list includes different classification results from the K classification results for the first disputed data.

[0123] For example, such as Figure 11As shown, the data display area shows the first disputed data information, including its identifier (e.g., #30042 / 29988), its content, and its original link. The content includes a video title (e.g., Model A aircraft makes its first public appearance at Airshow B), a video description (e.g., Model A aircraft performs its first public flight demonstration at Airshow B, showcasing excellent maneuverability. The aircraft employs a design…), and ASR text (e.g., Hello everyone, today we'll look at the design concept of Model A aircraft). Additionally, as… Figure 11 As shown, the data display area also shows the classification rule standards corresponding to the first disputed data in the classification rule standard text (e.g., classification standard: technology: content involving flight technology; model: oil / electric models that need to be assembled).

[0124] Continue to refer to Figure 11 As shown, the manual classification terminal displays a candidate list of classifications for the first disputed data in the classification area. This candidate list includes different classification results from K classification results for the first disputed data. For example, the different classification results from the K classification results for the first disputed data include: technology and model. Assuming K equals 3, two of the three first classification models predict that the first disputed data belongs to technology, and one first classification model predicts that the first disputed data belongs to model. Thus, technology receives 2 votes, and model receives 1 vote. Therefore, the candidate list displayed in the classification area includes both technology and model as candidate items. Optionally, two votes are displayed for the technology candidate item, and one vote is displayed for the model candidate item, so that the second party can refer to the voting results to determine the manual classification result for the first disputed data.

[0125] In some embodiments, such as Figure 11 As shown, the classification area in this embodiment can also display the predicted classification result of each of the K first classification models for the first disputed data. For example... Figure 11 In the classification area, the predicted classification results of the first classification model are displayed as follows: the predicted classification results of the first classification model 1 and the first classification model 2 are "technology", and the predicted classification result of the first classification model 3 is "model".

[0126] like Figure 11 As shown, the second object can select a candidate classification result from the classification candidate list as the manual classification result. In response to the second object's selection operation of the target classification result from the classification candidate list, the manual classification terminal determines that target classification result as the manual classification result for the first disputed data. For example... Figure 11As shown, the second object triggers the "technology" candidate item in the classification candidate list, and the manual classification terminal identifies this technology as the manual classification result of the first disputed data. Optionally, as... Figure 11 As shown, in response to the selection operation of the technology in the second object's classification candidate list, the manual classification terminal changes the unselected state of the technology candidate item to the selected state.

[0127] In some embodiments, such as Figure 12 As shown, the classification candidate list also includes a manual input option, which allows the second object to manually input the classification result. Specifically, when the second object triggers the manual input option, the manual classification terminal responds by displaying a classification input box. The second object can then input the manual classification result of the first disputed data into this box.

[0128] In some embodiments, such as Figure 11 As shown, the data display area in this embodiment includes the first disputed data and the corresponding classification rule standard in the classification rule standard text. The classification interface also includes a "next option". When the second object needs to mark the next piece of disputed data, the "next option" is triggered. In response to the second object's triggering operation of the "next option", the manual classification terminal displays the second disputed data and the corresponding classification rule standard in the classification rule standard text in the data display area. Simultaneously, a classification candidate list for the second disputed data is displayed in the classification area, which includes different classification results from the K classification results of the second disputed data.

[0129] In some embodiments, such as Figure 11 As shown, the classification interface also includes a skip option. When the second object triggers this skip option, the manual classification terminal responds by displaying the next disputed data and its corresponding classification rule standard in the classification rule standard text in the data display area. Simultaneously, a list of candidate classifications for the next disputed data is displayed in the classification area, including different classification results from the K possible classifications for the next disputed data.

[0130] In some embodiments, when the manual classification terminal detects that the second object has completed processing all assigned disputed data, it displays a submit option. The second object triggers this submit option, and the manual classification terminal, in response, sends the manual classification results of the disputed data to the classification platform. In one possible implementation, the manual classification terminal may first send the manual classification results of the disputed data to the manual platform, and then the manual platform sends the manual classification results of the disputed data to the classification platform. For example, when the manual platform assigns M disputed data to multiple manual classification terminals, after receiving the manual classification results of the disputed data from each of these multiple manual classification terminals, the manual platform sends the manual classification results of all M disputed data to the classification platform.

[0131] In this embodiment of the application, after the classification platform obtains the manual classification results of the M disputed data from the manual classification terminal, it executes the steps S102-C5 above, and selects P confusing data from the M disputed data based on the manual classification results of the M disputed data.

[0132] This application is an embodiment of the method by which a classification platform selects P confusing data points from M disputed data points based on the manual classification results of M disputed data points.

[0133] In some embodiments, for each of the M disputed data points, the manual classification result of the disputed data point is compared with the K classification results predicted by the K first classification models. If the manual classification result of the disputed data point is inconsistent with the classification result with the most votes among the K classification results, then the disputed data point is identified as confusing data. In this way, P confusing data points can be selected from the M disputed data points based on the manual classification results of the M disputed data points.

[0134] In some embodiments, S102-C5 above includes the following steps S102-C51 to S102-C55: S102-C51, Display the cross-validation interface, which includes cross-validation start options; In response to the first object's trigger operation on the cross-validation start options, divide the M disputed data into R data subsets, where R is a positive integer greater than 1 and less than M; S102-C52. The second classification model is trained for R rounds based on R data subsets. For the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets. The R-1 training sets are not exactly the same as the R-1 training sets selected in each round of training before the i-th round. i is a positive integer from 1 to R. S102-C53. Use R-1 training sets to train the second classification model to obtain the second classification model after the i-th round of training; S102-C54. Using the second classification model after the i-th round of training, classify and predict each disputed data included in the i-th validation set to obtain the predicted classification result of each disputed data in the i-th validation set. The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th round of training, and the i-th validation set is different from the validation set of each round of training before the i-th round of training. S102-C55. Based on the predicted classification results and manual classification results of each of the M disputed data, select P confusing data from the M disputed data.

[0135] In this implementation, after the classification platform obtains the manual classification results of M disputed data from the manual classification terminal, as follows: Figure 13 As shown, the cross-validation interface is displayed, including a cross-validation activation option. When the first object triggers this activation option, the classification platform, in response, divides the M disputed data points into R subsets. Each subset contains one or more disputed data points. Then, the second classification model is trained R times based on the R subsets.

[0136] In some embodiments, the above-described R-round training process for the second classification model is a separate training process that does not interfere with each other.

[0137] In some embodiments, the next round of training for the second classification model is based on the previous round of training, that is, the R rounds of training for the second classification model is iterative training.

[0138] In this embodiment, the training process of each round of the binary classification model training by the classification platform is basically the same. For ease of description, the i-th round of training is used as an example. For the i-th round of training, the classification platform selects R-1 data subsets from the R data subsets as R-1 training sets, and uses the remaining 1 data subset as a validation set. These R-1 training sets are not completely the same as the R-1 training sets selected in each round of training before the i-th round, and this validation set is also different from the validation sets in each round of training before the i-th round.

[0139] For example, consider R data subsets, specifically subset 1, subset 2, and subset 3. In the first training round, subset 1 and subset 2 are selected as the two training sets, and subset 3 is used as the validation set. In the second training round, subset 1 and subset 3 are selected as the two training sets, and subset 2 is used as the validation set. In the third training round, subset 2 and subset 3 are selected as the two training sets, and subset 1 is used as the validation set. The R-1 (e.g., 3) training sets selected in each training round are not entirely identical, and the validation set selected in each training round is also completely different.

[0140] For the i-th training round, the classification platform uses the R-1 training sets selected above for the i-th training round to train the second classification model, resulting in the second classification model after the i-th training round. For example, continuing with the R data subsets divided into 3 data subsets, in the first training round, data subset 1 and data subset 2 are selected as the 2 training sets for the first round to train the second classification model, resulting in the second classification model after the first round. In the second training round, data subset 1 and data subset 3 are selected as the 2 training sets for the second round to train the second classification model, resulting in the second classification model after the second round. In the third training round, data subset 2 and data subset 3 are selected as the 2 training sets for the third round to train the second classification model, resulting in the second classification model after the third round.

[0141] In one example, for each training round, such as the i-th round, the classification platform uses R-1 training sets from the i-th round to train the second classification model. The specific method for obtaining the second classification model after the i-th round of training can be as follows: Select one training set from the R-1 training sets to train the second classification model once. Then, select another training set from the R-1 training sets to train the second classification model again. This process is repeated R-1 times to obtain the second classification model after the i-th round of training.

[0142] In the embodiments of this application, such as Figure 14 As shown, based on the above steps, after obtaining the second classification model after the i-th round of training, the classification platform inputs the disputed data included in the i-th validation set into the second classification model after the i-th round of training for classification prediction, and obtains the predicted classification result for each disputed data in the i-th validation set.

[0143] For example, the classification platform divides M disputed data points into three subsets. In the first round of training, subset 1 and subset 2 are selected as the two training sets to train the second classification model, resulting in the trained second classification model. This trained second classification model is then used to classify and predict the disputed data included in subset 3, yielding a predicted classification result for each disputed data point in subset 3. In the second round of training, subset 1 and subset 3 are selected as the two training sets to train the second classification model, resulting in the trained second classification model. This trained second classification model is then used to classify and predict the disputed data included in subset 2, yielding a predicted classification result for each disputed data point in subset 2. In the third round of training, data subsets 2 and 3 are selected as the two training sets to train the second classification model, resulting in the trained second classification model. This trained model is then used to classify and predict the disputed data included in data subset 1, yielding the predicted classification result for each disputed data point in data subset 1. In this way, the classification platform can obtain the predicted classification result for each of the M disputed data points.

[0144] In this embodiment of the application, after the classification platform obtains the predicted classification result of each of the M disputed data based on the above steps, it executes the above steps S102-C55 to select P confusing data from the M disputed data based on the predicted classification result of each of the M disputed data and the manual classification result.

[0145] This application embodiment does not limit the specific method by which the classification platform selects P confusing data from the M disputed data based on the predicted classification results and manual classification results of each disputed data in the M disputed data.

[0146] In some embodiments, for each of the M disputed data points, if the predicted classification result and the manual classification result of the disputed data point are inconsistent, then the disputed data point is identified as confusing data. In this way, P confusing data points can be selected from the M disputed data points.

[0147] In some embodiments, the classification platform may determine the selection of P confusing data points from M disputed data points through the following steps: S102-C551. From M disputed data, select T discrepancies where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M. S102-C552. For each of the T differential data points, determine the confidence level of the second classification model's predicted classification result for the differential data. S102-C553. Based on the confidence level of each of the T differential data, select P differential data with a confidence level less than a preset threshold from the T differential data and use them as P confused data.

[0148] In this implementation, after determining the predicted classification result for each of the M disputed data points, the classification platform first selects T discrepancies between the predicted and manual classification results from these M disputed data points. For example, if the predicted and manual classification results for disputed data are inconsistent, then that disputed data point is identified as a discrepancy data point. This allows for the selection of T discrepancies data points from the M disputed data points.

[0149] In some embodiments, the final classification result of the MT disputed data, excluding the T difference data, can be the manual classification result (or the predicted classification result, since the predicted classification result is consistent with the manual classification result).

[0150] Next, for each of the T differential data points, the classification platform determines the confidence level of the second classification model's predicted classification result for that differential data point.

[0151] In one implementation of determining the confidence level, the second classification model outputs both the predicted classification result for the discrepancy data and the confidence level of that predicted classification result. In this way, the classification platform can obtain the confidence level of the predicted classification result for the discrepancy data from the second classification model.

[0152] In another approach to determining confidence, the classification platform obtains the word sequence corresponding to the predicted classification result of the discrepancy data. This word sequence includes L words, where L is a positive integer. The platform then obtains the log probability of each word in the process of generating this word sequence using the second classification model. These log probabilities are then summed to obtain a cumulative value. Finally, this cumulative value is exponentially calculated to obtain the confidence score of the second classification model for the predicted classification result.

[0153] Specifically, when the second classification model in this application embodiment outputs the predicted classification result, it generates the result token by token. For each generated token, the second classification model not only outputs the token but also provides the corresponding log probability, which represents the confidence level of the second classification model for that token. Thus, assuming that the token sequence corresponding to the predicted classification result of the difference data includes L tokens, the classification platform sums the log probabilities corresponding to these L tokens to obtain a cumulative value.

[0154] For example, the accumulated value can be determined using the following formula (1). : (1) in, Let be the logarithmic probability of the first word in a set of L words. Let be the logarithmic probability of the second word in a set of L words. Let be the logarithmic probability of the Lth word among L words.

[0155] Next, the accumulated values ​​are exponentially calculated to obtain the confidence score of the second classification model's predicted classification result for the discrepancy data.

[0156] Following the method described above, the classification platform can obtain the confidence level of the second classification model's predicted classification result for each of the T differential data points. Then, based on the confidence level, the classification platform can select P differential data points from the T differential data points whose confidence level is less than a preset threshold, as the P confused data points.

[0157] As described above, the classification platform in this application first selects T discrepancies between the predicted and manual classification results from M disputed data. Typically, 20%-25% (e.g., 6000-7500 entries) of the disputed data will show discrepancies. These discrepancies are further filtered using confidence levels, removing those with confidence levels greater than or equal to a preset threshold (e.g., 0.99). These extremely high-confidence discrepancies are usually occasional manual annotation errors, not obfuscated data. P discrepancies with confidence levels less than the preset threshold from the T discrepancies are identified as P obfuscated data. These obfuscated data reveal systematic and easily confused problems in manual annotation.

[0158] In some embodiments, the final classification result of the TP confusing data (excluding the P confusing data) among the T differential data can be the predicted classification result output by the second classification model.

[0159] In some embodiments, such as Figure 15 As shown, the cross-validation interface in this embodiment of the application also includes a validation result display area. After the classification platform selects the P confusing data from the M disputed data, it displays the validation result information in this validation result display area. The validation result information includes the training set and validation set selected in each of the R rounds of training, and at least one of the cross-validation conclusions. For example, the cross-validation conclusions may include the number of discrepancies (e.g., 6,720, accounting for 22.4%), the number of high-confidence discrepancies (confidence ≥ 0.99) (e.g., 512), and the number of confusing data (e.g., 6,208).

[0160] In this embodiment of the application, the classification platform, based on the above steps, determines P confusing data from N data to be classified, and then determines Q confusing classification pairs corresponding to these P confusing data.

[0161] Specifically, for each of the P confused data points, the predicted classification result and the manual classification result of that confused data point are considered as a confused classification pair. This identifies the confused classification pair corresponding to each of the P confused data points. Then, duplicate confused classification pairs are removed from the P confused data points, resulting in Q confused classification pairs. For example, for confused data point 1, assuming the predicted classification result is "technology" and the manual classification result is "model," then (technology, model) is identified as a confused classification pair.

[0162] In some embodiments, such as Figure 15 As shown, the cross-validation interface also includes a confusion analysis option. When the first object triggers this option, the classification platform responds by determining the confusion matrix corresponding to the P confused data points based on the manual classification results and predicted classification results of the P confused data points. The elements in this confusion matrix... This indicates that the manual classification result is The predicted classification result is The amount of data. Based on this confusion matrix, Q confusion classification pairs are then determined.

[0163] For example, based on the manual classification results and predicted classification results of P confused data, the confusion matrix corresponding to the P confused data is constructed as shown in Table 1, with the different classifications of the manual classification results (e.g., model, technology, sports / NBA, sports / CBA) as the column elements of the confusion matrix and the different classifications of the predicted classification results (e.g., model, technology, sports / NBA, sports / CBA) as the row elements of the confusion matrix: Table 1

[0164] It should be noted that the confusion matrix shown in Table 1 above is just an example.

[0165] As shown in Table 1, the confusion matrix in this embodiment is an n×n matrix, and the elements in the confusion matrix are... This indicates that the manual classification result is from P confusing data points. The predicted classification result is The amount of data.

[0166] In this way, the classification platform can determine Q confusing classification pairs using the confusion matrix. For example, "technology" and "model" are a pair of categories that are easily confused, and "sports-NBA" and "sports-CBA" are another pair of categories that are easily confused. Therefore, "technology" and "model" are determined as one confusing classification pair, and "sports-NBA" and "sports-CBA" are determined as another confusing classification pair.

[0167] In some embodiments, when determining Q confused classification pairs based on the confusion matrix, the classification platform sorts the confused classification pairs by frequency, filtering out random confused classification pairs with extremely low frequency (e.g., single digits), and focusing on high-frequency confused classification pairs that account for a large proportion (e.g., 15%-20%) of the total manually labeled data. These high-frequency confused classification pairs typically involve only a small number of classification pairs, but the difference in data they involve constitutes the majority of the total confused data.

[0168] In some embodiments, such as Figure 16 As shown, in response to the triggering of the first object pair obfuscation analysis option, the classification platform displays the obfuscation analysis results interface. This obfuscation analysis results interface includes Q obfuscation classification pairs (e.g., 2 obfuscation classification pairs). In one example, as... Figure 16 As shown, the confusion analysis results interface also includes the confusion matrix determined above.

[0169] The above describes the specific process by which a classification platform selects P confusing data points from N data points to be classified based on the standard text of classification rules, and determines the Q confusing classification pairs corresponding to these P confusing data points. Next, the classification platform executes step S103 as follows.

[0170] S103. Display the rule supplementation interface and receive the classification rule supplementation text for the Q confused classification pairs entered by the first object in the rule supplementation interface.

[0171] The rule supplement interface displays the aforementioned Q obfuscation classification pairs.

[0172] This application embodiment does not restrict the specific triggering conditions for the supplementary interface of the classification platform display rules.

[0173] In some embodiments, after the classification platform detects and identifies P confused data points and Q confused classification pairs, it automatically displays the rule supplementation interface.

[0174] In some embodiments, such as Figure 16 As shown, the above-mentioned confusion analysis results interface also includes a rule supplementation option. The first object can trigger this rule supplementation option. In response to the first object's triggering of this rule supplementation option, the classification platform displays the rule supplementation interface.

[0175] like Figure 17 As shown, the rule supplementation interface of this application embodiment displays Q obfuscation classification pairs. Thus, the first object can input supplementary text of the classification rules for these Q obfuscation classification pairs in the rule supplementation interface based on the displayed obfuscation classification pairs.

[0176] In some embodiments, such as Figure 17 As shown, the rule supplementation interface of this application embodiment includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation example display area. The directory area displays Q entries (e.g., two entries corresponding to two obfuscation classification pairs) that are one-to-one with the Q obfuscation classification pairs. Each entry has a status indicator indicating the rule supplementation completion status. At this time, receiving the classification rule supplementation text for the Q obfuscation classification pairs input by the first object in the rule supplementation interface in S103 includes the following steps S103-A to S103-C: S103-A, In response to the triggering operation of the first directory entry in the first object pair directory area which is in an unprocessed state, in the existing standard display area, display the existing classification rule summary of the first obfuscated classification pair corresponding to the first directory entry, the first obfuscated classification pair including two easily obfuscated categories; S103-B, Display data information of at least one obfuscated data associated with the first obfuscation classification pair in the obfuscation sample display area; S103-C, Receive the first object in the rule supplement box, and supplement the input classification rule for the first confusion classification.

[0177] like Figure 17 As shown in the embodiment of this application, the directory area displays Q entries corresponding one-to-one with Q obfuscated classification pairs. When a first object needs to supplement the classification rules for one of the Q obfuscated classification pairs, the directory entry corresponding to that obfuscated classification pair (e.g., the first obfuscated classification pair) (e.g., the first directory entry) is triggered. In response to the first object's triggering operation on the first directory entry in the directory area that is in an unprocessed state, the classification platform displays a summary of the existing classification rules for the first obfuscated classification pair in the existing standard display area. For example... Figure 17 As shown, the first object triggers the "model" The category entry for "Technology," which is confusingly categorized, will be displayed as "Model" in the existing standard display area. A summary of existing classification rules for the confusing category pair "technology". Optionally, this summary of existing classification rules could be "model". The obfuscated category "technology" is a summary of the corresponding category rule annotation in the category rule annotation text.

[0178] At the same time, such as Figure 17As shown, the classification platform should display data information of at least one obfuscated data associated with the first obfuscated classification pair in the obfuscation sample display area. Optionally, as... Figure 17 As shown, the data information of two confused data and one unconfused data associated with the first confusion classification pair can be displayed.

[0179] Thus, as Figure 17 As shown, the first object can refer to the existing classification rule summary and confusion data information of the first confusion classification pair, and enter supplementary classification rule text for the first confusion classification pair in the rule supplement box. For example, the output should be operable, readable, and clearly distinguishable supplementary classification rule text for the two categories in the first confusion classification pair. It should be noted that... Figure 17 The supplementary rules shown are merely examples, and the embodiments of this application do not impose any limitations on them.

[0180] In some embodiments, such as Figure 17 As shown, the rule supplementation interface also includes a "Submit Rule" option. After the first object enters supplementary text for the classification rule of the first obfuscation classification pair in the rule supplementation box, it can trigger the "Submit Rule" option. In response to the first object's triggering of this option, the submission platform saves the supplementary text for the classification rule of the first obfuscation classification pair, and as follows... Figure 18 As shown, the status of the first obfuscated category entry in the directory area is changed from unprocessed to processed.

[0181] In some embodiments, the first object may trigger another unprocessed directory entry (e.g., a second directory entry) in the directory area. In response to the first object's triggering operation on the unprocessed second directory entry in the directory area, the classification platform displays a summary of existing classification rules for the second obfuscated classification pair corresponding to the second directory entry in the existing standard display area. Simultaneously, data information of at least one obfuscated data associated with the second obfuscated classification pair is displayed in the obfuscation sample display area. Thus, the first object can refer to the summary of existing classification rules for the second obfuscated classification pair and the data information of at least one obfuscated data associated with the second obfuscated classification pair, and enter supplementary text for the classification rules of the second obfuscated classification pair in the rule supplement box.

[0182] Following the steps above, the first object can add supplementary text for the classification rules of each of the Q confusion classification pairs in the rule supplementation interface.

[0183] The above embodiment describes the specific process by which the first object supplements the classification rule supplement text for Q obfuscated classification pairs in the rule supplement interface. Next, the classification platform executes the following step S104.

[0184] S104. Based on the supplementary text of the classification rules and the standard text of the classification rules, determine the classification results of P confused data.

[0185] In this embodiment of the application, the classification platform determines P obfuscated data based on the above steps, and after obtaining the standard text of the classification rules input by the first object, it determines the classification result of each obfuscated data in the P obfuscated data based on the supplementary text of the classification rules and the standard text of the classification rules.

[0186] This application does not limit the specific implementation method of the classification platform determining the classification results of P confused data based on the supplementary text of classification rules and the standard text of classification rules.

[0187] In some embodiments, the classification platform inputs supplementary text of classification rules, standard text of classification rules, and P confused data points into a trained classification model (e.g., any one of the K first classification models mentioned above) for classification processing. Based on the supplementary text and standard text of classification rules, the classification model can accurately predict the classification of the P confused data points and output the classification result for each of the P confused data points.

[0188] In some embodiments, S104 above includes the following steps S104-A to S104-C: S104-A, Display the adjudication interface, which includes a list of adjudication models and adjudication activation options; S104-B, Receive the first object's selection operation of the target adjudication model in the adjudication model list; S104-C, in response to the triggering operation of the first object pair adjudication initiation option, through the target adjudication model, based on the standard text of the classification rule, the supplementary text of the classification rule, and Q obfuscated classification pairs, determines the classification result of each obfuscated data in P obfuscated data.

[0189] In this implementation, such as Figure 19 As shown, the classification platform displays an adjudication interface, which includes a list of adjudication models and an adjudication initiation option. The list of adjudication models includes several different types of adjudication models. These different types of adjudication models can be understood as different types of classification models used to predict the classification results of confused data.

[0190] This application embodiment does not limit the triggering conditions for the classification platform to display the adjudication interface. For example, after detecting that the first object has entered supplementary text for the classification rules of each of the Q obfuscated classification pairs, the classification platform automatically jumps to display the adjudication interface. As another example, after detecting that the first object has entered supplementary text for the classification rules of each of the Q obfuscated classification pairs, the classification platform displays a completion option in the rule supplementation interface. When the first object triggers this completion option, the classification platform responds to the first object's triggering operation on the completion option by displaying the adjudication interface.

[0191] like Figure 19 As shown, the adjudication interface in this embodiment includes a list of adjudication models, which includes multiple adjudication models, such as trimming model 1 and trimming model 2. A first object can select one trimming model from these multiple adjudication models as the target trimming model; for example, the first object selects trimming model 1 as the target trimming model. In response to the first object's selection operation of the target trimming model, the classification platform adjusts the state of the target trimming model from an unselected state to a selected state.

[0192] In some embodiments, such as Figure 19 As shown, the adjudication interface also includes adjudication input information, which includes: standard text of classification rules, supplementary text of classification rules, P obfuscation data, and Q obfuscation classification pairs.

[0193] Next, as Figure 19 As shown, the first object triggers the adjudication initiation option. In response to this triggering operation, the classification platform inputs the aforementioned adjudication input information into the target adjudication model. Then, using this target adjudication model, based on the standard text of the classification rules, the supplementary text of the classification rules, and Q obfuscated classification pairs, it classifies each of the P obfuscated data points and outputs the classification result for each obfuscated data point.

[0194] In some embodiments, such as Figure 20 As shown, after the classification platform detects the first object triggering the pruning start option, it displays a classification progress bar in the pruning interface. This progress bar indicates the classification progress of the target pruning model on these P confused data.

[0195] In some embodiments, such as Figure 21As shown, after determining the classification results of P confused data, the classification platform displays a classification result interface, which includes summary information of the classification results of N (e.g., 100,000) data to be classified. For example, as described above, in determining the classification results of these N data to be classified in this embodiment, the platform first (i.e., step 1) performs classification prediction on the N data to be classified using K first classification models. Among these, the K classification results of M data to be classified are different, while the K classification results of NM (e.g., 70,012) data to be classified are consistent. Therefore, the classification results of the NM data to be classified are determined by the consistency of the classification results of the K first classification models. Next (step 2), the classification platform sends the M disputed data points to a manual classification terminal for manual classification. Simultaneously, a second classification model is used to predict the classification of these M disputed data points, resulting in a predicted classification model. Based on the manual classification results and the predicted classification results, P confusing data points are selected from the M disputed data points. The manual classification results and predicted classification results for the remaining MP (e.g., 24,488) disputed data points are consistent with the P confusing data points. Therefore, the classification results for these MP disputed data points are determined based on the consistency between manual classification and model classification. Finally (step 3), the classification results for these P (e.g., 5,500) confusing data points are determined through a target decision model. Therefore, as... Figure 22 As shown, the summary information of the classification results of the N (e.g., 100,000) data to be classified includes the number of data, the proportion, and the classification quality of the classification results determined in each of the above three steps. Optionally, it may also include the overall classification consistency result of these N data to be classified (e.g., 97.2%).

[0196] In some embodiments, such as Figure 21 As shown, the classification results interface also includes options to export an analysis report and to export the final dataset. When the first object triggers the export analysis report option, the classification platform exports a summary of the classification results. When the first object triggers the export final dataset option, the classification platform exports the classification results for these N unclassified data points.

[0197] In some embodiments, in model training scenarios, such as Figure 22 As shown, the classification result interface in this embodiment may also include a one-click training and automatic deployment option and identification information of the model to be trained (e.g., model 1 to be trained). When the first object triggers the one-click training option, the classification platform responds to the first object's triggering operation on the one-click training and automatic deployment option by using the above-mentioned N data to be classified and the classification results of the N data to be classified to train the selected model to be trained, and deploys the trained model in the corresponding service.

[0198] The data classification method provided in this application embodiment first involves a classification platform displaying a data input interface and receiving classification rule standard text and N unclassified data points input by a first object in the data input interface. The unclassified data points in this application embodiment include at least one of text data, image data, video data, or audio data. Next, based on the classification rule standard text, the classification platform selects P obfuscated data points from the N unclassified data points and determines Q obfuscated classification pairs corresponding to these P obfuscated data points. Then, the classification platform displays a rule supplement interface, which shows these Q obfuscated classification pairs. The classification platform receives supplementary classification rule text for the Q obfuscated classification pairs input by the first object in the rule supplement interface, and then determines the classification result of the P obfuscated data points based on the supplementary classification rule text and the classification rule standard text. Therefore, this application embodiment develops a visual classification platform. A first object (such as a product operator or domain expert) can directly input the standard text of classification rules and the data to be classified on this platform. The platform can then automatically filter out obfuscated data from N data points based on the standard text of classification rules, and display a rule supplement interface based on the selected obfuscated data. The first object can input supplementary text of classification rules for obfuscated classification pairs in this supplementary interface, thus establishing clear distinction criteria for each specific obfuscated classification pair. In this way, the classification platform can accurately classify obfuscated data based on the supplementary text and the standard text of classification rules. That is, this application embodiment solves the problem of obfuscated data classification through a closed-loop approach of identifying obfuscation, supplementing rules, and accurate classification, improving the accuracy and consistency of data classification. For training sample annotation scenarios, the method of this application embodiment can provide higher-quality training samples for supervised machine learning models, enabling the finally trained model to have stronger robustness and generalization ability when handling edge cases and complex data, thereby improving model performance. For real-time classification scenarios, the method in this application embodiment can accurately classify text, images, and videos, improving the accuracy and efficiency of content classification and review. Furthermore, the method in this application embodiment allows the classification platform to automatically classify simple data within the data to be classified, with the first object only supplementing rules for a small amount of obfuscated data. This saves the first object a significant amount of time spent manually classifying and reviewing simple data, thereby improving data classification efficiency. Further, the classification platform in this application embodiment guides the first object (e.g., product operators or domain experts) through a data input interface and a rule supplementation interface, enabling them to complete the entire process from standard setting to rule optimization. This exposes hidden quality risks, transforming complex knowledge engineering into simple interactive operations and lowering the professional threshold.

[0199] The above provides an overall overview of the data classification method proposed in the embodiments of this application. The following section will discuss this method in conjunction with... Figure 23 Taking the interaction between the classification platform and the manual classification terminal as an example, this application will further introduce the data classification data of the embodiment.

[0200] Figure 23 This is a schematic flowchart of a data classification method provided in an embodiment of this application.

[0201] like Figure 23 As shown, the data classification method of this application embodiment includes the following steps: S201. The classification platform displays the data input interface and receives the classification rule standard text and N data to be classified from the first object in the data input interface.

[0202] The data to be classified includes at least one of text data, image data, video data, or audio data, where N is a positive integer.

[0203] The specific implementation process of S201 can be referred to the relevant description of S101 above, and will not be repeated here.

[0204] S202. In response to the selection operation of the first object on the list of K first classification models, the classification platform changes the state of the K first classification models from the unselected state to the selected state.

[0205] The data input interface also includes a list of classification models and a parallel classification startup option, where K is a positive integer greater than 1.

[0206] The specific implementation process of S202 can be referred to the relevant description of S102-A above, and will not be repeated here.

[0207] S203. In response to the first object's trigger operation of the parallel classification start option, the classification platform classifies the N data to be classified based on the classification rule standard text through each of the K first classification models, and obtains K classification results for each data to be classified.

[0208] Where K is a positive integer greater than 1.

[0209] The specific implementation process of S203 can be referred to the relevant description of S102-B above, and will not be repeated here.

[0210] S204. Based on the K classification results of each unclassified data, the classification platform selects M disputed data from N unclassified data.

[0211] Among them, the K classification results of the disputed data are not completely consistent, and M is a positive integer less than or equal to N.

[0212] The specific implementation process of S204 can be referred to the relevant description of S102-C1 above, and will not be repeated here.

[0213] S205. The classification platform displays a consistency analysis interface based on M disputed data points.

[0214] The consistency analysis interface includes allocation options.

[0215] S206. In response to the triggering operation of the first object pair allocation option, the classification platform sends a classification request to the manual classification terminal.

[0216] The classification request includes M disputed data items and a standard text of classification rules. The classification request is used to request manual classification of the M disputed data items.

[0217] The specific implementation process of S206 can be referred to the relevant descriptions of S102-C3 above, and will not be repeated here.

[0218] The steps for manual sorting are described in S207 to S208 below.

[0219] S207. The manual classification terminal displays the classification interface and receives the manual classification results of each disputed data entered by the second object in the classification area.

[0220] The classification interface includes a data display area and a classification area. The data display area shows disputed data and the standard text of classification rules.

[0221] In some embodiments, the classification request further includes K classification results for each of the M disputed data. In this case, the first disputed data is displayed in the data display area, and a classification candidate list for the first disputed data is displayed in the classification area. The classification candidate list includes different classification results among the K classification results of the first disputed data, and the first disputed data is one of the disputed data in the M disputed data. In response to the second object's selection operation on the target classification result in the classification candidate list, the target classification result is determined as the manual classification result of the first disputed data.

[0222] In some embodiments, the classification candidate list further includes a manual input option. The method in this application embodiment further includes: the manual classification terminal responding to the second object's triggering operation on the manual input option, displaying a classification input box; and receiving the manual classification result of the first disputed data entered by the second object in the classification input box.

[0223] In some embodiments, in response to the second object's triggering operation for the next option, the manual classification terminal displays the second disputed data and the classification rule standard corresponding to the second disputed data in the classification rule standard text in the data display area; and displays a classification candidate list of the second disputed data in the classification area, the classification candidate list including different classification results among the K classification results of the second disputed data.

[0224] S208. The manual classification terminal sends the manual classification results of M disputed data to the classification platform.

[0225] S209, The classification platform displays the cross-validation interface.

[0226] The cross-validation interface includes cross-validation startup options.

[0227] S210, In response to the triggering operation of the first object pair cross-validation initiation option, the classification platform divides the M disputed data into R data subsets.

[0228] Where R is a positive integer greater than 1 and less than M.

[0229] The specific implementation process of S210 can be referred to the relevant description of S102-C51 above, and will not be repeated here.

[0230] S211. The classification platform trains the second classification model for R rounds based on R data subsets. For the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets. The second classification model is trained using the R-1 training sets to obtain the second classification model after the i-th round of training. The second classification model after the i-th round of training is then used to classify and predict each disputed data included in the i-th validation set to obtain the predicted classification result for each disputed data in the i-th validation set.

[0231] The R-1 training sets are all different from the R-1 training sets selected in each round of training before the i-th round, where i is a positive integer from 1 to R.

[0232] The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th round of training, and the i-th validation set is different from the validation set of each round of training before the i-th round.

[0233] The specific implementation process of S211 can be referred to the relevant descriptions of S102-C52 to S102-C54 above, and will not be repeated here.

[0234] S212. Based on the predicted classification results and manual classification results of each of the M disputed data, the classification platform selects P confusing data from the M disputed data.

[0235] In some embodiments, the classification platform selects T discrepancies from M disputed data where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M; for each of the T discrepancies, the confidence level of the second classification model's predicted classification result for the discrepancies is determined; based on the confidence level corresponding to each of the T discrepancies, P discrepancies with confidence levels less than a preset threshold are selected from the T discrepancies as P confused data.

[0236] In some embodiments, for each disputed data point among T discrepancies, the classification platform determines the confidence level of the second classification model's predicted classification result for the discrepancies, including: the classification platform obtaining the word sequence corresponding to the predicted classification result of the discrepancies, the word sequence including L words, where L is a positive integer; obtaining the log probability corresponding to each word in the process of generating the word sequence by the second classification model; summing the log probabilities corresponding to the L words to obtain a cumulative value; and performing an exponential operation on the cumulative value to obtain the confidence score of the second classification model for the predicted classification result.

[0237] In some embodiments, the cross-validation interface further includes a validation result display area, and the method further includes: after the classification platform selects P confusing data from M disputed data, it displays validation result information in the validation result display area. The validation result information includes at least one of the training set and validation set selected in each of the R rounds of training and the cross-validation conclusion information. The cross-validation conclusion information includes at least the number of P confusing data.

[0238] The specific implementation process of S212 can be referred to the relevant description of S102-C55 above, and will not be repeated here.

[0239] S213. In response to the triggering operation of the first object pair confusion analysis option, the classification platform determines the confusion matrix corresponding to the P confused data based on the manual classification results and predicted classification results of the P confused data, and determines Q confused classification pairs based on the confusion matrix.

[0240] Among them, the elements in the confusion matrix This indicates that the manual classification result is The predicted classification result is The amount of data.

[0241] The cross-validation interface also includes an obfuscation analysis option.

[0242] The specific implementation process of S213 above can be referred to the relevant description of determining the confusion classification pair in the above embodiments, and will not be repeated here.

[0243] S214. The classification platform responds to the first object's trigger operation on the obfuscation analysis option and displays the obfuscation analysis results interface.

[0244] The confusion analysis results interface includes Q confusion classification pairs and rule supplementation options.

[0245] S215. In response to the first object's trigger operation on the rule supplementation option, the classification platform displays the rule supplementation interface.

[0246] The rule supplementation interface includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation sample display area. The directory area displays Q entries, each corresponding to one of the Q obfuscation categories. Each entry has a status indicator indicating the completion status of the rule supplementation.

[0247] The specific implementation process of S215 can be referred to the relevant description of S103 above, and will not be repeated here.

[0248] S216. In response to the triggering operation of the first directory entry in the first object pair directory area which is in an unprocessed state, the classification platform displays the existing classification rule summary of the first obfuscated classification pair corresponding to the first directory entry in the existing standard display area, and displays the data information of at least one obfuscated data associated with the first obfuscated classification pair in the obfuscation sample display area.

[0249] The specific implementation process of S216 can be referred to the relevant descriptions of S103-A and S103-B above, and will not be repeated here.

[0250] S217. The classification platform receives the first object in the rule supplement box and supplements the input classification rule with text for the first confusion classification.

[0251] In some embodiments, the rule supplementation interface also includes a rule submission option, and the method further includes: in response to the triggering operation of the rule submission option of the first object pair, the classification platform saves the classification rule supplementation text of the first obfuscated classification pair, and changes the status of the first obfuscated classification pair entry in the directory area from an unprocessed state to a processed state.

[0252] The specific implementation process of S217 can be referred to the relevant description of S103-C above, and will not be repeated here.

[0253] S218. The classification platform displays the adjudication interface and receives the first object's selection operation of the target adjudication model in the adjudication model list.

[0254] The adjudication interface includes a list of adjudication models and adjudication initiation options.

[0255] S219. In response to the triggering operation of the first object pair adjudication initiation option, the classification platform determines the classification result of each of the P obfuscated data based on the standard text of the classification rules, the supplementary text of the classification rules, and the Q obfuscated classification pairs through the target adjudication model.

[0256] The specific implementation process of S218 to S219 can be referred to the relevant descriptions of S104-A to S104-C above, and will not be repeated here.

[0257] S220, the classification platform displays the classification results interface.

[0258] The classification results interface includes summary information on the classification results of N data to be classified.

[0259] The data classification method provided in this application combines model prediction with division of labor classification to achieve accurate classification of easily confused data, thereby improving the accuracy of data classification. Furthermore, in this application's classification method, the majority (e.g., approximately 70%) of the classification work is automatically and accurately completed by a large model, while manual processing is only required for a small portion (e.g., 30%) of the disputed data. This reduces the need for a large number of classification personnel, lowers labor costs, and significantly shortens the overall classification cycle. Further, the classification platform in this application guides a first-party (e.g., product operations personnel or domain experts) through a data input interface and a rule supplementation interface, completing the entire process from standard setting to rule supplementation. Meanwhile, a manual classification terminal guides a second-party to manually classify the small amount of disputed data through a classification interface. Finally, the classification platform guides the first-party to select a target trimmed model and classify the confused data through an adjudication interface. The entire classification process transforms complex knowledge engineering into simple interactive operations, lowering the professional threshold.

[0260] The following is combined with Figure 24 Taking a classification platform including terminal devices and servers as an example, the data classification data of this application embodiment will be further described.

[0261] Figure 24 This is a schematic flowchart of a data classification method provided in an embodiment of this application.

[0262] like Figure 24 As shown, the data classification method of this application embodiment includes the following steps: S301. The terminal device displays a data input interface and receives the classification rule standard text and N data to be classified from the first object in the data input interface.

[0263] The data to be classified includes at least one of text data, image data, video data, or audio data, where N is a positive integer.

[0264] The specific implementation process of S301 can be referred to the relevant description of S101 above, and will not be repeated here.

[0265] S302, the terminal device responds to the selection operation of K first classification models in the first object pair classification model list by changing the state of the K first classification models from unselected to selected.

[0266] The data input interface also includes a list of classification models and a parallel classification startup option, where K is a positive integer greater than 1.

[0267] The specific implementation process of S302 can be referred to the relevant description of S102-A above, and will not be repeated here.

[0268] S303. In response to the triggering operation of the first object pair parallel classification start option, the terminal device sends the identification information of K first classification models, N data to be classified, and classification rule standard text to the server.

[0269] S304. The server uses each of the K first classification models to classify the N data to be classified based on the classification rule standard text, and obtains K classification results for each data to be classified.

[0270] Where K is a positive integer greater than 1.

[0271] The specific implementation process of S304 can be referred to the relevant description of S102-B above, and will not be repeated here.

[0272] S305. Based on the K classification results of each unclassified data, the server selects M disputed data from N unclassified data.

[0273] Among them, the K classification results of the disputed data are not completely consistent, and M is a positive integer less than or equal to N.

[0274] The specific implementation process of S305 can be referred to the relevant description of S102-C1 above, and will not be repeated here.

[0275] S306. The server sends M disputed data items to the terminal device.

[0276] S307. The terminal device displays a consistency analysis interface based on M disputed data points.

[0277] The consistency analysis interface includes allocation options.

[0278] S308. In response to the triggering operation of the first object pair allocation option, the terminal device sends a classification request to the manual classification terminal.

[0279] The classification request includes M disputed data items and a standard text of classification rules. The classification request is used to request manual classification of the M disputed data items.

[0280] The specific implementation process of S308 can be referred to the relevant descriptions of S102-C3 above, and will not be repeated here.

[0281] S309. The manual classification terminal displays the classification interface based on the classification request and receives the manual classification results of each disputed data entered by the second object in the classification area.

[0282] The classification interface includes a data display area and a classification area. The data display area shows disputed data and the standard text of classification rules.

[0283] In some embodiments, the classification request further includes K classification results for each of the M disputed data. In this case, the first disputed data is displayed in the data display area, and a classification candidate list for the first disputed data is displayed in the classification area. The classification candidate list includes different classification results among the K classification results of the first disputed data, and the first disputed data is one of the disputed data in the M disputed data. In response to the second object's selection operation on the target classification result in the classification candidate list, the target classification result is determined as the manual classification result of the first disputed data.

[0284] In some embodiments, the classification candidate list further includes a manual input option. The method in this application embodiment further includes: the manual classification terminal responding to the second object's triggering operation on the manual input option, displaying a classification input box; and receiving the manual classification result of the first disputed data entered by the second object in the classification input box.

[0285] In some embodiments, in response to the second object's triggering operation for the next option, the manual classification terminal displays the second disputed data and the classification rule standard corresponding to the second disputed data in the classification rule standard text in the data display area; and displays a classification candidate list of the second disputed data in the classification area, the classification candidate list including different classification results among the K classification results of the second disputed data.

[0286] S310, The manual classification terminal sends the manual classification results of M disputed data to the terminal device.

[0287] S311. The terminal device displays a cross-validation interface based on the manual classification results of M disputed data.

[0288] The cross-validation interface includes cross-validation startup options.

[0289] S312. In response to the triggering operation of the first object pair cross-validation start option, the terminal device sends the manual classification results of M disputed data to the server.

[0290] S313. The server divides the M disputed data into R data subsets.

[0291] Where R is a positive integer greater than 1 and less than M.

[0292] The specific implementation process of S313 can be referred to the relevant description of S102-C51 above, and will not be repeated here.

[0293] S314. The server trains the second classification model for R rounds based on R data subsets. For the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets, and the second classification model is trained using the R-1 training sets to obtain the second classification model after the i-th round of training. The second classification model after the i-th round of training is used to classify and predict each disputed data included in the i-th validation set to obtain the predicted classification result of each disputed data in the i-th validation set.

[0294] The R-1 training sets are all different from the R-1 training sets selected in each round of training before the i-th round, where i is a positive integer from 1 to R.

[0295] The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th round of training, and the i-th validation set is different from the validation set of each round of training before the i-th round.

[0296] The specific implementation process of S314 can be referred to the relevant descriptions of S102-C52 to S102-C54 above, and will not be repeated here.

[0297] S315. Based on the predicted classification result and the manual classification result of each of the M disputed data, the server selects P confusing data from the M disputed data.

[0298] In some embodiments, the classification platform selects T discrepancies from M disputed data where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M; for each of the T discrepancies, the confidence level of the second classification model's predicted classification result for the discrepancies is determined; based on the confidence level corresponding to each of the T discrepancies, P discrepancies with confidence levels less than a preset threshold are selected from the T discrepancies as P confused data.

[0299] In some embodiments, for each disputed data point among T discrepancies, the classification platform determines the confidence level of the second classification model's predicted classification result for the discrepancies, including: the classification platform obtaining the word sequence corresponding to the predicted classification result of the discrepancies, the word sequence including L words, where L is a positive integer; obtaining the log probability corresponding to each word in the process of generating the word sequence by the second classification model; summing the log probabilities corresponding to the L words to obtain a cumulative value; and performing an exponential operation on the cumulative value to obtain the confidence score of the second classification model for the predicted classification result.

[0300] In some embodiments, the cross-validation interface further includes a validation result display area, and the method further includes: after the classification platform selects P confusing data from M disputed data, it displays validation result information in the validation result display area. The validation result information includes at least one of the training set and validation set selected in each of the R rounds of training and the cross-validation conclusion information. The cross-validation conclusion information includes at least the number of P confusing data.

[0301] The specific implementation process of S315 can be referred to the relevant description of S102-C55 above, and will not be repeated here.

[0302] S316. The server sends P pieces of obfuscated data to the terminal device.

[0303] S317. The terminal device displays an obfuscation analysis option based on P obfuscated data, and in response to the first object's trigger operation on the obfuscation analysis option, determines the obfuscation matrix corresponding to the P obfuscated data based on the manual classification result and the predicted classification result of the P obfuscated data, and determines Q obfuscation classification pairs based on the obfuscation matrix.

[0304] Among them, the elements in the confusion matrix This indicates that the manual classification result is The predicted classification result is The amount of data.

[0305] The cross-validation interface also includes an obfuscation analysis option.

[0306] S318. The terminal device responds to the triggering operation of the first object pair's obfuscation analysis option and displays the obfuscation analysis result interface.

[0307] The confusion analysis results interface includes Q confusion classification pairs and rule supplementation options.

[0308] S319. The terminal device responds to the triggering operation of the rule supplementation option of the first object and displays the rule supplementation interface.

[0309] The rule supplementation interface includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation sample display area. The directory area displays Q entries, each corresponding to one of the Q obfuscation categories. Each entry has a status indicator indicating the completion status of the rule supplementation.

[0310] The specific implementation process of S319 can be referred to the relevant description of S103 above, and will not be repeated here.

[0311] S320, in response to a triggering operation of a first directory entry in an unprocessed state in the first object pair directory area, the terminal device displays a summary of existing classification rules for the first obfuscation classification pair corresponding to the first directory entry in the existing standard display area, and displays data information of at least one obfuscated data associated with the first obfuscation classification pair in the obfuscation sample display area.

[0312] The first confusion category includes two easily confused categories.

[0313] The specific implementation process of S320 can be referred to the relevant descriptions of S103-A and S103-B above, and will not be repeated here.

[0314] S321. The terminal device receives the first object in the rule supplement box and supplements the input classification rule for the first confusion classification.

[0315] In some embodiments, the rule supplementation interface also includes a rule submission option, and the method further includes: in response to the triggering operation of the rule submission option of the first object pair, the classification platform saves the classification rule supplementation text of the first obfuscated classification pair, and changes the status of the first obfuscated classification pair entry in the directory area from an unprocessed state to a processed state.

[0316] The specific implementation process of S321 can be referred to the relevant description of S103-C above, and will not be repeated here.

[0317] S322, The terminal device displays the adjudication interface and receives the first object's selection operation of the target adjudication model in the adjudication model list.

[0318] The adjudication interface includes a list of adjudication models and adjudication initiation options. S323. In response to the triggering operation of the first object pair adjudication initiation option, the terminal device sends the identification information of the target pruning model, the supplementary text of the classification rules, Q obfuscated classification pairs and P obfuscated data to the server.

[0319] S324. The server uses the target adjudication model to determine the classification result of each of the P confused data based on the standard text of the classification rules, the supplementary text of the classification rules, and the Q confused classification pairs.

[0320] S325. The server sends a summary of the classification results for N unclassified data to the terminal device.

[0321] The classification results interface includes summary information on the classification results of N data to be classified.

[0322] Optionally, the summary information includes the classification result for each of the P confused data points.

[0323] S326. The terminal device displays the classification results interface.

[0324] The classification results interface includes summary information on the classification results of N data to be classified.

[0325] The data classification method provided in this application exchanges data with a human-classified intelligent terminal through a collaborative platform. It organically combines multiple large language models, cross-validation, and human expert knowledge to form a closed-loop, continuously optimized classification pipeline. This not only has significant advantages in efficiency and cost but also achieves a qualitative leap in the quality of the final output data and model performance, providing a solid data foundation for the development of high-quality AI models.

[0326] The above text combined Figures 2 to 24 The method embodiments of this application are described in detail below, in conjunction with... Figure 25 The following describes in detail the device embodiments of this application.

[0327] Figure 25 This is a schematic block diagram of a data classification device provided in an embodiment of this application.

[0328] like Figure 25 As shown, the data classification device 10, applied to a classification platform, includes: The data input unit 11 is used to display a data input interface and receive the classification rule standard text and N data to be classified input by the first object in the data input interface. The data to be classified includes at least one of text data, image data, video data or audio data, and N is a positive integer. The obfuscated data determination unit 12 is used to select P obfuscated data from the N data to be classified based on the classification rule standard text, and determine Q obfuscated classification pairs corresponding to the P obfuscated data, where P is a positive integer less than or equal to N and Q is a positive integer less than or equal to P. The rule supplementation unit 13 is used to display a rule supplementation interface and receive the classification rule supplementation text of the Q confusion classification pairs input by the first object in the rule supplementation interface, wherein the rule supplementation interface displays the Q confusion classification pairs. Classification unit 14 is used to determine the classification result of the P confused data based on the supplementary text of the classification rules and the standard text of the classification rules.

[0329] In some embodiments, the data input interface further includes: a classification model list and a parallel classification start option; and a confused data determination unit 12, specifically configured to, in response to the first object's selection operation of K first classification models in the classification model list, adjust the state of the K first classification models from unselected to selected, where K is a positive integer greater than 1; in response to the first object's triggering operation of the parallel classification start option, classify the N data to be classified based on the classification rule standard text using each of the K first classification models, obtaining K classification results for each data to be classified, where K is a positive integer greater than 1; and, based on the K classification results for each data to be classified, select the P confused data from the N data to be classified.

[0330] In some embodiments, the obfuscated data determination unit 12 is specifically configured to: select M disputed data from the N unclassified data based on K classification results for each unclassified data, wherein the K classification results for the disputed data are not completely consistent, and M is a positive integer less than or equal to N; display a consistency analysis interface based on the M disputed data, the consistency analysis interface including allocation options; in response to the first object's triggering operation on the allocation options, send a classification request to a manual classification terminal, the classification request including the M disputed data and the classification rule standard text, the classification request being used to request manual classification of the M disputed data; obtain the manual classification results of the M disputed data from the manual classification terminal; and select P obfuscated data from the M disputed data based on the manual classification results of the M disputed data.

[0331] In some embodiments, the obfuscated data determination unit 12 is specifically used to display a cross-validation interface, the cross-validation interface including a cross-validation start option; in response to the first object's triggering operation on the cross-validation start option, the M disputed data are divided into R data subsets, where R is a positive integer greater than 1 and less than M; the second classification model is trained for R rounds based on the R data subsets, and for the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets, where the R-1 training sets are not completely the same as the R-1 training sets selected in each round of training before the i-th round, where i is a positive integer from 1 to R; using The second classification model is trained on the R-1 training sets to obtain the second classification model after the i-th training round. Using the second classification model after the i-th training round, each disputed data point in the i-th validation set is classified and predicted to obtain the predicted classification result for each disputed data point in the i-th validation set. The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th training round, and the i-th validation set is different from the validation sets in each round of training before the i-th round. Based on the predicted classification result and the manual classification result for each disputed data point in the M disputed data sets, P confusing data points are selected from the M disputed data points.

[0332] In some embodiments, the confusion data determination unit 12 is specifically configured to select T discrepancy data points from the M disputed data points where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M; for each of the T discrepancy data points, determine the confidence level of the predicted classification result of the second classification model for the discrepancy data point; and based on the confidence level corresponding to each of the T discrepancy data points, select P discrepancy data points from the T discrepancy data points whose confidence level is less than a preset threshold as the P confusion data points.

[0333] In some embodiments, the confusion data determination unit 12 is specifically used to obtain the word sequence corresponding to the predicted classification result of the difference data, the word sequence including L words, where L is a positive integer; obtain the log probability corresponding to each word in the process of the second classification model generating the word sequence; sum the log probabilities corresponding to the L words to obtain a cumulative value; and perform an exponential operation on the cumulative value to obtain the confidence score of the second classification model for the predicted classification result.

[0334] In some embodiments, the cross-validation interface further includes a validation result display area, and the method further includes: in some embodiments, the obfuscated data determination unit 12 is further configured to display validation result information in the validation result display area after selecting the P obfuscated data from the M disputed data, the validation result information including at least one of the training set and validation set selected in each round of training in R rounds of training, and cross-validation conclusion information, the cross-validation conclusion information including at least the number of the P obfuscated data.

[0335] In some embodiments, the cross-validation display interface further includes a confusion analysis option. A confusion data determination unit 12 is specifically configured to, in response to a trigger operation of the first object on the confusion analysis option, determine a confusion matrix corresponding to the P confused data based on the manual classification results and predicted classification results of the P confused data, wherein the elements in the confusion matrix... This indicates that the manual classification result is The predicted classification result is The number of data points; based on the confusion matrix, determine the Q confusion classification pairs.

[0336] In some embodiments, the rule supplementation unit 13 is further configured to display an obfuscation analysis result interface in response to the first object's triggering operation on the obfuscation analysis option, the obfuscation analysis result interface including the Q obfuscation classification pairs and the rule supplementation option; and to display the rule supplementation interface in response to the first object's triggering operation on the rule supplementation option.

[0337] In some embodiments, the rule supplementation interface includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation example display area. The directory area displays Q entries corresponding one-to-one with the Q obfuscation classification pairs, and each entry has a status identifier indicating the rule supplementation completion status. The rule supplementation unit 13 is specifically used to respond to the first object's trigger operation on a first directory entry in the directory area that is in an unprocessed state, displaying an existing classification rule summary of the first obfuscation classification pair corresponding to the first directory entry in the existing standard display area, wherein the first obfuscation classification pair includes two easily obfuscated categories; displaying data information of at least one obfuscated data associated with the first obfuscation classification pair in the obfuscation example display area; and receiving the classification rule supplementation text input by the first object in the rule supplementation box for the first obfuscation classification pair.

[0338] In some embodiments, the rule supplementation interface further includes a rule submission option. The rule supplementation unit 13 is also configured to, in response to the first object's triggering operation on the rule submission option, save the supplementary text of the classification rules for the first obfuscated classification pair, and change the status of the first obfuscated classification pair entry in the directory area from an unprocessed state to a processed state.

[0339] In some embodiments, the classification unit 14 is specifically used to display an adjudication interface, the adjudication interface including an adjudication model list and an adjudication initiation option; receive the first object's selection operation of a target adjudication model in the adjudication model list; and, in response to the first object's triggering operation of the adjudication initiation option, determine the classification result of each of the P obfuscated data based on the target adjudication model, the standard text of the classification rule, the supplementary text of the classification rule, and the Q obfuscated classification pairs.

[0340] In some embodiments, the classification unit 13 is further configured to display a classification result interface, which includes summary information of the classification results of the N data to be classified.

[0341] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be found in the method embodiments. To avoid repetition, further details are omitted here. Specifically, Figure 25 The apparatus shown can execute the above-described classification platform-side method embodiments, and the aforementioned and other operations and / or functions of each module in the apparatus are respectively for implementing the above-described system-side method embodiments, which will not be described in detail here for the sake of brevity.

[0342] Figure 26 This is a schematic block diagram of a data classification device provided in an embodiment of this application.

[0343] like Figure 26 As shown, the data classification device 20, applied to a manual classification terminal, includes: The request receiving unit 21 is used to receive a classification request sent by the classification platform. The classification request includes M disputed data and a classification rule standard text. The classification request is used to request manual classification of the M disputed data. The M disputed data are selected from the N unclassified data based on K classification results of each unclassified data. The K classification results of the disputed data are not completely consistent. The K classification results are obtained by classifying the unclassified data using K first classification models based on the classification rule standard text. The K first classification models are selected by the first object from the list of classification models included in the data input interface. The unclassified data includes at least one of text data, image data, video data, or audio data. Display unit 22 is used to display a classification interface, the classification interface including a data display area and a classification area, the data display area displaying the disputed data and the standard text of the classification rules; The classification result receiving unit 23 is used to receive the manual classification result of each disputed data entered by the second object in the classification area; The sending unit 24 is used to send the manual classification results of the M disputed data to the classification platform.

[0344] In some embodiments, the classification request further includes K classification results for each of the M disputed data. The classification result receiving unit 23 is specifically configured to display the first disputed data in the data display area and display a classification candidate list for the first disputed data in the classification area. The classification candidate list includes different classification results among the K classification results of the first disputed data, and the first disputed data is one of the disputed data in the M disputed data. In response to the second object's selection operation on the target classification result in the classification candidate list, the target classification result is determined as the manual classification result of the first disputed data.

[0345] In some embodiments, the classification candidate list further includes a manual input option, and the classification result receiving unit 23 is further configured to display a classification input box in response to the second object's triggering operation on the manual input option; and receive the manual classification result of the first disputed data entered by the second object in the classification input box.

[0346] In some embodiments, the data display area includes the first disputed data and the classification rule standard corresponding to the first disputed data in the classification rule standard text. The classification interface also includes a next option. The classification result receiving unit 23 is further configured to, in response to the second object's triggering operation on the next option, display the second disputed data and the classification rule standard corresponding to the second disputed data in the classification rule standard text in the data display area; and display a classification candidate list of the second disputed data in the classification area, the classification candidate list including different classification results among the K classification results of the second disputed data.

[0347] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be found in the method embodiments. To avoid repetition, further details are omitted here. Specifically, Figure 26 The apparatus shown can execute the above-described method embodiment for manual classification on the terminal side, and the aforementioned and other operations and / or functions of each module in the apparatus are respectively for implementing the above-described method embodiment on the system side. For the sake of brevity, they will not be described in detail here.

[0348] The apparatus of this application embodiment has been described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in this application embodiment can be directly embodied as being executed by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. Optionally, the software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.

[0349] Figure 27 This is a schematic block diagram of an electronic device provided in an embodiment of this application. The electronic device can be the classification platform or manual classification terminal described above.

[0350] like Figure 27 As shown, the electronic device 40 may include: The system includes a memory 41 and a processor 42. The memory 41 stores a computer program 43 and transfers the program code 43 to the processor 42. In other words, the processor 42 can retrieve and run the computer program 43 from the memory 41 to implement the methods described in the embodiments of this application.

[0351] For example, the processor 42 can be used to execute the steps in the method 200 described above according to the instructions in the computer program 43.

[0352] In some embodiments of this application, the processor 42 may include, but is not limited to: General-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0353] In some embodiments of this application, the memory 41 includes, but is not limited to: Volatile memory and / or non-volatile memory. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

[0354] In some embodiments of this application, the computer program 43 may be divided into one or more modules, which are stored in the memory 41 and executed by the processor 42 to complete the page recording method provided in this application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 43 in the electronic device.

[0355] like Figure 27 As shown, the electronic device 40 may further include: Transceiver 44, which can be connected to processor 42 or memory 41.

[0356] The processor 42 can control the transceiver 44 to communicate with other devices; specifically, it can send information or data to other devices or receive information or data sent by other devices. The transceiver 44 may include a transmitter and a receiver. The transceiver 44 may further include antennas, and the number of antennas may be one or more.

[0357] It should be understood that the various components in the electronic device 40 are connected through a bus system, which includes a data bus, a power bus, a control bus, and a status signal bus.

[0358] According to one aspect of this application, a computer storage medium is provided that stores a computer program thereon, which, when executed by a computer, enables the computer to perform the methods of the above-described method embodiments. Alternatively, embodiments of this application also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the methods of the above-described method embodiments.

[0359] According to another aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method described in the above-described method embodiments.

[0360] In other words, when implemented using software, it can be implemented wholly or partially in the form of a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (DVD)), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0361] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0362] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.

[0363] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. For example, the functional modules in the various embodiments of this application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.

[0364] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A data classification method, characterized in that, Applied to a classification platform, the method includes: Display the data input interface and receive the classification rule standard text and N data to be classified from the first object in the data input interface, where N is a positive integer; Based on the classification rule standard text, P confusing data are selected from the N data to be classified, and Q confusing classification pairs corresponding to the P confusing data are determined, where P is a positive integer less than or equal to N, and Q is a positive integer less than or equal to P. The rule supplementation interface is displayed, and the classification rule supplementation text of the Q confusion classification pairs input by the first object in the rule supplementation interface is received. The rule supplementation interface displays the Q confusion classification pairs. Based on the supplementary text of the classification rules and the standard text of the classification rules, the classification results of the P confused data are determined.

2. The method according to claim 1, characterized in that, The data input interface also includes: a list of classification models and a parallel classification start option. The step of selecting P obfuscated data points from the N unclassified data points based on the classification rule standard text includes: In response to the first object's selection operation of K first classification models in the classification model list, the state of the K first classification models is adjusted from unselected to selected, where K is a positive integer greater than 1; In response to the first object's triggering operation on the parallel classification start option, the N data to be classified are classified respectively using each of the K first classification models based on the classification rule standard text, to obtain K classification results for each data to be classified, where K is a positive integer greater than 1; Based on the K classification results of each unclassified data, P confusing data are selected from the N unclassified data.

3. The method according to claim 2, characterized in that, The step of selecting P confusing data points from the N unclassified data points based on the K classification results for each unclassified data point includes: Based on the K classification results of each unclassified data, M disputed data are selected from the N unclassified data. The K classification results of the disputed data are not completely consistent, and M is a positive integer less than or equal to N. Based on the M disputed data, a consistency analysis interface is displayed, which includes allocation options. In response to the first object's triggering operation on the allocation option, a classification request is sent to the manual classification terminal. The classification request includes the M disputed data and the classification rule standard text. The classification request is used to request manual classification of the M disputed data. Obtain the manual classification results of the M disputed data from the manual classification terminal; Based on the manual classification results of the M disputed data, P confusing data are selected from the M disputed data.

4. The method according to claim 3, characterized in that, The manual classification result based on the M disputed data points, selecting the P confusing data points from the M disputed data points, includes: Display a cross-validation interface, which includes a cross-validation start option; in response to the first object's trigger operation on the cross-validation start option, divide the M disputed data into R data subsets, where R is a positive integer greater than 1 and less than M; The second classification model is trained for R rounds based on the R data subsets. For the i-th round of training, R-1 data subsets are selected from the R data subsets as R-1 training sets. The R-1 training sets are not exactly the same as the R-1 training sets selected in each round of training before the i-th round. i is a positive integer from 1 to R. Using the R-1 training sets, the second classification model is trained to obtain the second classification model after the i-th round of training; Using the second classification model after the i-th round of training, each disputed data included in the i-th validation set is classified and predicted to obtain the predicted classification result of each disputed data in the i-th validation set. The i-th validation set is the subset of data in the R data subsets that did not participate in the i-th round of training, and the i-th validation set is different from the validation set of each round of training before the i-th round of training. Based on the predicted classification result and the manual classification result of each of the M disputed data, P confusing data are selected from the M disputed data.

5. The method according to claim 4, characterized in that, Based on the predicted classification result and the manual classification result of each of the M disputed data, the P confusing data are selected from the M disputed data, including: From the M disputed data, select T discrepancies where the predicted classification result and the manual classification result are inconsistent, where T is a positive integer less than or equal to M; For each of the T differential data points, determine the confidence level of the second classification model's predicted classification result for the differential data point; Based on the confidence level of each of the T differential data points, P differential data points with confidence levels less than a preset threshold are selected from the T differential data points and used as the P confused data points.

6. The method according to claim 5, characterized in that, For each disputed data point among the T discrepancies, determining the confidence level of the second classification model's predicted classification result for the discrepancies includes: Obtain the word sequence corresponding to the predicted classification result of the difference data, wherein the word sequence includes L words, where L is a positive integer; Obtain the log probability of each word in the process of generating the word sequence by the second classification model; The log probabilities corresponding to the L lexical units are summed to obtain the cumulative value; The accumulated value is subjected to an exponential operation to obtain the confidence score of the second classification model for the predicted classification result.

7. The method according to claim 4, characterized in that, The cross-validation interface also includes a validation result display area, and the method further includes: After selecting the P confused data from the M disputed data, the verification result information is displayed in the verification result display area. The verification result information includes at least one of the training set and validation set selected in each of the R rounds of training, and the cross-validation conclusion information. The cross-validation conclusion information includes at least the number of the P confused data.

8. The method according to claim 5, characterized in that, The cross-validation display interface also includes an obfuscation analysis option, wherein determining the Q obfuscation classification pairs corresponding to the P obfuscated data includes: In response to the first object's triggering operation on the obfuscation analysis option, based on the manual classification results and predicted classification results of the P obfuscated data, a confusion matrix corresponding to the P obfuscated data is determined, and the elements in the confusion matrix are... This indicates that the result of manual classification is The predicted classification result is The amount of data; Based on the confusion matrix, the Q confusion classification pairs are determined.

9. The method according to claim 8, characterized in that, The method further includes: In response to the first object's triggering operation on the obfuscation analysis option, an obfuscation analysis result interface is displayed, which includes the Q obfuscation classification pairs and rule supplementation options; The display rule supplement interface includes: In response to the first object's triggering operation on the rule supplementation option, the rule supplementation interface is displayed.

10. The method according to any one of claims 1-9, characterized in that, The rule supplementation interface includes a directory area and an editing area. The editing area includes an existing standard display area, a rule supplementation box, and an obfuscation example display area. The directory area displays Q entries that correspond one-to-one with the Q obfuscation classification pairs. Each entry has a status indicator indicating the rule supplementation completion status. The step of receiving the classification rule supplement text for the Q obfuscated classification pairs input by the first object in the rule supplement interface includes: In response to the first object's triggering operation on a first directory entry in the directory region that is in an unprocessed state, a summary of existing classification rules for the first obfuscated classification pair corresponding to the first directory entry is displayed in the existing standard display area. The first obfuscated classification pair includes two easily confused categories. The data information of at least one obfuscated data associated with the first obfuscated classification pair is displayed in the obfuscation sample display area; The first object receives supplementary text in the rule supplement box, which is used to supplement the input classification rule for the first confusion classification.

11. The method according to claim 10, characterized in that, The rule supplementation interface also includes a rule submission option, and the method further includes: In response to the first object's trigger operation on the submission rule option, the supplementary text of the classification rule for the first obfuscated classification pair is saved, and the status of the first obfuscated classification pair entry in the directory area is changed from unprocessed to processed.

12. The method according to any one of claims 1-9, 11, characterized in that, The process of determining the classification results of the P confused data based on the supplementary text of the classification rules and the standard text of the classification rules includes: The adjudication interface is displayed, which includes a list of adjudication models and adjudication activation options; Receive the first object's selection operation for a target adjudication model in the adjudication model list; In response to the first object's triggering operation on the adjudication initiation option, the classification result of each of the P obfuscated data is determined by the target adjudication model based on the standard text of the classification rule, the supplementary text of the classification rule, and the Q obfuscated classification pairs.

13. A data classification method, characterized in that, The method, applied to a manual sorting terminal, includes: The system receives a classification request from a classification platform. The classification request includes M disputed data points and a standard text of classification rules. The classification request is used to request manual classification of the M disputed data points. The M disputed data points are selected from the N unclassified data points based on K classification results for each of the N unclassified data points. The K classification results for the disputed data points are not completely consistent. The K classification results are obtained by classifying the unclassified data points using K first classification models based on the standard text of classification rules. The K first classification models are selected by the first object from the list of classification models included in the data input interface. The classification interface includes a data display area and a classification area, wherein the data display area displays the disputed data and the standard text of the classification rules; Receive the manual classification results for each piece of disputed data entered by the second object in the classification area; The manual classification results of the M disputed data are sent to the classification platform.

14. The method according to claim 13, characterized in that, The classification request also includes K classification results for each of the M disputed data points, and the manual classification results for each disputed data point input by the receiving second object in the classification area, including: The first disputed data is displayed in the data display area, and a candidate list of classifications for the first disputed data is displayed in the classification area. The candidate list of classifications includes different classification results among the K classification results of the first disputed data, and the first disputed data is one of the disputed data among the M disputed data. In response to the second object's selection operation on the target classification result in the classification candidate list, the target classification result is determined as the manual classification result of the first disputed data.

15. The method according to claim 14, characterized in that, The candidate list for classification also includes a manual input option, and the method further includes: In response to the second object's triggering action on the manual input option, a category input box is displayed; Receive the manual classification result of the first disputed data entered by the second object in the classification input box.

16. The method according to claim 14, characterized in that, The data display area includes the first disputed data and the classification rule standard corresponding to the first disputed data in the classification rule standard text. The classification interface also includes a next option. The method further includes: In response to the second object's triggering operation for the next option, the second disputed data and the corresponding classification rule standard in the classification rule standard text are displayed in the data display area; The classification area displays a list of candidate classifications for the second disputed data, which includes different classification results from the K classification results of the second disputed data.

17. A data classification device, characterized in that, The device includes: A data input unit is used to display a data input interface and receive the classification rule standard text and N data to be classified input by the first object in the data input interface, where N is a positive integer; The obfuscated data determination unit is used to select P obfuscated data from the N data to be classified based on the classification rule standard text, and determine Q obfuscated classification pairs corresponding to the P obfuscated data, where P is a positive integer less than or equal to N and Q is a positive integer less than or equal to P. A rule supplementation unit is used to display a rule supplementation interface and receive the classification rule supplementation text of the Q confusion classification pairs input by the first object in the rule supplementation interface, wherein the rule supplementation interface displays the Q confusion classification pairs. A classification unit is used to determine the classification result of the P confused data based on the supplementary text of the classification rules and the standard text of the classification rules.

18. A data classification device, characterized in that, The device includes: A request receiving unit is used to receive a classification request sent by a classification platform. The classification request includes M disputed data and a classification rule standard text. The classification request is used to request manual classification of the M disputed data. The M disputed data are selected from the N unclassified data based on K classification results for each unclassified data. The K classification results of the disputed data are not completely consistent. The K classification results are obtained by classifying the unclassified data using K first classification models based on the classification rule standard text. The K first classification models are selected by the first object from the list of classification models included in the data input interface. A display unit is used to display a classification interface, which includes a data display area and a classification area. The data display area displays the disputed data and the standard text of the classification rules. The classification result receiving unit is used to receive the manual classification result of each disputed data entered by the second object in the classification area; The sending unit is used to send the manual classification results of the M disputed data to the classification platform.

19. An electronic device, characterized in that, Including processor and memory; The memory is used to store computer programs; The processor is configured to execute the computer program to implement the method as described in any one of claims 1 to 12 or 13 to 16.

20. A computer-readable storage medium, characterized in that, Used to store computer programs; The computer program causes the computer to perform the method as described in any one of claims 1 to 12 or 13 to 16.