Data classification method for screening outlier character data
The data classification method uses character embeddings and a generative pre-trained transformer model to automate outlier detection and labeling, addressing the inefficiencies of manual labeling in large datasets, enhancing model robustness and reducing costs.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- HTC CORP
- Filing Date
- 2024-08-29
- Publication Date
- 2026-06-12
AI Technical Summary
Existing machine learning models face challenges in efficiently identifying and removing outliers from large datasets, which can lead to model deviation, increased complexity, overfitting, and decreased robustness, particularly in large-scale language models, due to the high cost and time-consuming nature of manual labeling.
A data classification method utilizing character embeddings in a semantic space, combined with an outlier detection algorithm, selects partial samples for manual labeling, and employs a generative pre-trained transformer model to generate normal-outlier prediction labels for unlabeled samples, reducing the need for extensive manual labor.
This approach significantly reduces the time and cost of labeling large datasets by automating the identification and removal of outliers, improving model quality and efficiency in large-scale language models.
Smart Images

Figure 0007860189000001 
Figure 0007860189000002 
Figure 0007860189000003
Abstract
Description
[Technical Field] 【0001】 The present invention relates to a classification method, and more particularly to a classification method for identifying normal values or outliers in unlabeled data. [Background technology] 【0002】 In machine learning techniques, outlier detection is a common operation used to identify whether there are data instances in a dataset that deviate significantly from the normal distribution. Outlier detection is important in a variety of application areas, including medical prevention, fraud detection, network security, quality control, and anomaly detection in healthcare or industrial processes. [Overview of the project] 【0003】 This disclosure provides a data classification method comprising the steps of: obtaining a plurality of character samples from a dataset; converting the plurality of character samples into a plurality of character embeddings in a semantic space; generating an outlier-normal order of the plurality of character samples based on an outlier detection algorithm based on the distance between the character embeddings in the semantic space; selecting a plurality of subsamples from the character samples based on the outlier-normal order; receiving a manual input command to specify a manual input label for the subsample; generating a prompt message including a task command, unlabeled data, and anchor data generated based on the subsample with the manual input label and a plurality of unlabeled samples in the character sample; and providing the prompt message to a generative pre-trained transformer model to generate a plurality of normal-outlier prediction labels associated with the plurality of unlabeled samples. 【0004】 This disclosure provides another data classification method comprising the steps of: obtaining a plurality of character samples from a dataset; converting the plurality of character samples into a plurality of character embeddings in a semantic space; generating an outlier-normal order of the character samples based on the distance between the character embeddings in the semantic space, based on an outlier detection algorithm; selecting a plurality of subsamples from the character samples based on the outlier-normal order; receiving a manual input command to specify a plurality of manual input labels for the subsamples; generating a first prompt message including the subsamples having the manual input labels and a feature engineering task command; providing the first prompt message to a generative pretraining transformer model to generate a plurality of distinguishable features; generating a second prompt message including the distinguishable features, character samples, and a feature scoring task command; providing the second prompt message to the generative pretraining transformer model to generate a plurality of feature predictions for the distinguishable features of the character samples; and running a classification algorithm based on the feature predictions of the subsamples and unlabeled samples in the character samples to generate a plurality of normal-outlier prediction labels for the unlabeled samples. 【0005】 It should be noted that this application is intended to provide the above general description and the following detailed description by reference to the examples, and to further interpret the present invention as asserted. [Brief explanation of the drawing] 【0006】 [Figure 1] This is a schematic block diagram of an electronic device according to one embodiment of the present disclosure. [Figure 2] This is a flowchart of a data classification method according to one embodiment of the present disclosure. [Figure 3A] This is a schematic diagram of a character sample obtained from a dataset relating to one embodiment of the present disclosure. [Figure 3B]This is a schematic diagram of character embeddings in semantic space for character samples obtained from a dataset according to one embodiment of the present disclosure. [Figure 3C] This is a schematic diagram of the outlier-normal value order of a character sample according to one embodiment of the present disclosure. [Figure 3D] This is a schematic diagram of a partial sample and a manually entered label associated with a partial sample according to one embodiment of the present disclosure. [Figure 3E] This figure shows how a prompt message is generated based on a partial sample of a manually entered label and an unlabeled sample of a character sample according to one embodiment of the present disclosure. [Figure 4] This is a schematic diagram of a prompt message according to one embodiment of the present disclosure. [Figure 5] This is a flowchart of a data classification method relating to another embodiment of the present disclosure. [Figure 6] This is a schematic diagram of a first prompt message according to one embodiment of the present disclosure. [Figure 7] This is a schematic diagram of a second prompt message according to one embodiment of the present disclosure. [Figure 8] This figure shows how a classification algorithm generates normal value-outlier prediction labels according to one embodiment of the present disclosure. [Modes for carrying out the invention] 【0007】 Embodiments of this disclosure will be described below with reference to the relevant drawings. In the drawings, the same reference numerals represent the same or similar elements or processes of the same or similar methods. 【0008】 When developing a chat robot or large-scale language model, collecting a training dataset containing input characters and corresponding output characters is crucial. For example, when developing a chat robot, the input characters may be exemplary questions from potential users, and the output characters may be the answers the chat robot should provide. This training dataset is used not only for model development (e.g., building the chat robot) but also for model evaluation (e.g., evaluating the accuracy or effectiveness of the chat robot). In well-known techniques, training datasets are manually labeled by experts, but hiring such a team is very expensive. By using an automated generator to automatically generate the training dataset (automatically generating the corresponding training character samples) instead of the personnel performing the labeling, the cost and time required to build the training dataset can be reduced. Since the quality of training character samples generated by an automated generator is inconsistent, the generated character samples need to be validated. 【0009】 The above validation is used to remove low-quality samples (e.g., incorrectly generated characters) from the training dataset and to use only validated character samples within the training dataset during model development and evaluation. Since low-quality samples have a serious impact on the effectiveness of the model, removing them during the validation process can contribute to improving the model quality when building subsequent models. 【0010】 In some cases, automatically generated datasets may contain more than 10,000 sample characters that require validation. One solution is to hire professional labelers to manually check the quality of all sample characters. However, manual inspection is very costly and the process is extremely slow. 【0011】 Refer to Figures 1 and 2 together. Figure 1 is a schematic block diagram of an electronic device 100 according to one embodiment of the present disclosure. Figure 2 is a flowchart of a data classification method 200 according to one embodiment of the present disclosure. The data classification method 200 is used to generate normal-outlier predictions related to character samples in a dataset DB. The dataset DB may be obtained from a character content server. For example, the dataset DB may be collected from an encyclopedia website, a news website, a question and answer database, a forum, a novel storage server, a journal database, or any similar character content storage center. 【0012】 As shown in Figure 1, the electronic device 100 includes a processing unit 110, an input interface 120, a storage unit 130, a display 140, and a communication circuit 150. In some embodiments, the electronic device 100 may be a computer, a smartphone, a tablet computer, an image processing server, a data server, or any image processing device having similar functions. The input interface 120 is used to receive a dataset DB and to manually input commands. In some embodiments, the electronic device 100 is used to classify character samples in the dataset DB, generate normal-outlier predictions for character samples, and remove outlier samples and retain normal samples based on the normal-outlier predictions. 【0013】 The input interface 120 may include a data transmission interface, a wireless communication circuit, a keyboard, a mouse, a microphone, or any input device having similar functionality. The processing unit 110 is coupled to the input interface 120, the storage unit 130, the display 140, and the communication circuit 150. The storage unit 130 is used to store program code. The program code stored in the storage unit 130 is used to instruct the processing unit 110 to execute the data classification method 200 shown in Figure 2. In some embodiments, the processing unit 110 may be a processor, a graphics processor, an application-specific integrated circuit, or a processing circuit having similar functionality. The communication circuit 150 may be a network transceiver (e.g., a wireless area network transceiver, a telecommunications transceiver, or an Ethernet transceiver). The communication circuit 150 is used to contact and communicate with the generative pre-training transformer model 190. In some embodiments, the generative pre-training transformer model 190 may be an independently operating, isolated model running on an external server other than the electronic device 100. 【0014】 Ideally, machine learning models are trained on training datasets free of outliers. The presence of outliers in the training dataset can have various impacts on the effectiveness and behavior of the machine learning model. For example, outliers in the training dataset can lead to problems such as model deviation, increased model complexity, overfitting, decreased robustness, and difficulty in anomaly detection. 【0015】 In large-scale language model application programs, the unlabeled character paragraphs in the training dataset DB may include different types of character content, such as news reports, popular novels, literary works, bank statements, chat logs, questions and answers, research papers, or program code. These different types of character content have different uses in different fields. It is necessary to filter this character content to remove noisy data and extract clean data, such as valid character samples, to train a machine learning model for a specific purpose. 【0016】 For example, in one embodiment, when the training objective of a machine learning model is used to answer medical questions, text content related to medical records, health discussions, diagnoses, symptoms and / or medical history should be considered normal text samples, while other text content related to global warming, financial crises and / or baseball games should be considered outliers. 【0017】 Manually labeling normal and outlier samples in a dataset database can be extremely time-consuming and costly, especially when dealing with large datasets. In such cases, experts would need to manually verify and identify normal and outlier samples, which may not be practical for processing large datasets with a massive number of character samples. 【0018】 In some embodiments, the electronic device 100 and the data classification method 200 provide a method that has time and cost benefits to generate normal-outlier predictions for each of multiple character samples in a dataset DB. 【0019】 As shown in Figures 1 and 2, the processing unit 110 performs step S210 to obtain character samples from the dataset DB. See further Figure 3A. Figure 3A is a schematic diagram of character samples TS1, TS2, TS3…TS14 obtained from the dataset DB according to one embodiment of the present disclosure. Character samples TS1-TS14 may be character paragraphs (e.g., sentences, paragraphs, or texts) or combinations of questions and corresponding answers. The number of character samples TS1-TS14 shown in Figure 3A is a simplified example for illustrative purposes and is not limited to this disclosure. In practice, the dataset DB may contain hundreds, thousands, or more character samples. In some embodiments, character samples TS1-TS14 may be obtained from the dataset DB by a character segmentation algorithm performed by the processing unit 110. 【0020】 For example, text sample TS1 could be "Question: How can I get rid of a headache? Answer: Try placing a heat pad on your neck," text sample TS2 could be "Question: How can I reduce belly fat? Answer: Moderate exercise and a diet plan may help," and text sample TS3 could be "Question: What's the weather like today? Answer: It's sunny outside." Ku, The character samples TS1-TS14 may include normal samples related to the medical topic (e.g., character samples TS1 and TS2) and outlier samples not related to the medical topic (e.g., character sample TS3). As shown in Figure 3A, assume that character samples TS3, TS6, TS7, TS11, and TS13 are noise data (i.e., outlier samples not related to the medical topic), and character samples TS1, TS2, TS4, TS5, TS8, TS9, TS10, TS12, and TS14 are clean data (i.e., normal samples related to the medical topic). In this embodiment, the character samples TS1-TS14 are initially unlabeled. The electronic device 100 and data classification method 200 are used to label the character samples TS1-TS14 to separate the noise data from the clean data. 【0021】 As shown in Figures 1 and 2, the processing unit 110 performs step S220, which converts character samples TS1-TS14 into character embeddings in semantic space. See also Figure 3B, which is a schematic diagram of character embeddings eTS1-eTS14 in semantic space SS for character samples TS1-TS14 obtained from a dataset DB according to one embodiment of the present disclosure. 【0022】 A character embedding eTS1 is a projection of a character sample TS1 onto a higher-dimensional latent space. In the semantic space SS, eTS1 is a vector or a sequence of digits. For illustrative purposes, the semantic space SS shown in Figure 3B is described as a two-dimensional distribution; however, in reality, the semantic space SS may have more dimensions, such as 768 or 1536. If two of the character samples TS1-TS14 have similar meanings, then two corresponding character embeddings eTS1-eTS14 will be close to each other in the semantic space SS. Conversely, if two of the character samples TS1-TS14 have different meanings, then two corresponding character embeddings eTS1-eTS14 will be far apart in the semantic space SS. 【0023】 As shown in Figures 1 and 2, the processing unit 110 performs step S230, which generates an outlier-normal order of character samples TS1-TS14 in the semantic space SS based on an outlier detection algorithm, based on the distance between character embeddings eTS1-eTS14. See also Figure 3C, which is a schematic diagram of the outlier-normal order RK of character samples TS1-TS14 according to one embodiment of the present disclosure. 【0024】 In some implementation examples, the outlier detection algorithm may be implemented by a random sampling consistency-NN algorithm. In some other embodiments, the outlier detection algorithm may be an isolation forest algorithm or a local outlier factor algorithm. 【0025】 The processing unit 110 is used to run an outlier detection algorithm (e.g., a random sampling consistency-NN algorithm) on character samples TS1-TS14 based on the distance between character embeddings eTS1-eTS14 in the semantic space SS, and the processing unit 110 can obtain a "normal score" for each sample. Samples adjacent to other character embeddings (short distance from other characters) within character embeddings eTS1-eTS14 have a high "normal score," while samples far from other character embeddings (long distance from other characters) within character embeddings eTS1-eTS14 have a low "normal score." A "normal score" closer to 1 indicates higher quality for this character sample. A lower "normal score" indicates lower quality for this character sample. Since each of the character samples TS1-TS14 has its own "normal score," an outlier-normal order RK can be generated for the character samples TS1-TS14. 【0026】 Based on the outlier-normality order RK shown in Figure 3C, character samples TS11, TS6, TS9, TS13, and TS5 (which have low normality scores) tend to belong to the outlier samples. Also, as shown in Figure 3C, character samples TS11, TS6, TS9, TS13, and TS5 are all ranked high in the outlier-normality order RK. 【0027】 Based on the outlier-normality order RK shown in Figure 3C, character samples TS14, TS1, TS12, TS4, TS10, TS8, TS3, TS2, and TS7 (which have high normality scores) tend to belong to the normal sample. Also, as shown in Figure 3C, character samples TS14, TS1, TS12, TS4, TS10, TS8, TS3, TS2, and TS7 are positioned at the bottom of the outlier-normality order RK. 【0028】 As shown in Figures 1, 2, and 3C, the processing unit 110 performs step S240 to select partial samples PS from character samples TS1 to TS14 based on the outlier-normal order RK. As shown in Figure 3C, based on the outlier-normal order RK, character samples TS11, TS6, TS9, TS13, and TS5, which tend to belong to the outlier samples, are selected as partial samples PS. The partial samples PS are transmitted from the processing unit 110 to the display 140. 【0029】 In some embodiments, the label maker can view / verify a partial sample PS on the display 140, and then the label maker can input a manually entered label MLB associated with the partial sample PS via the input interface 120. Please refer further to Figure 3D, which is a schematic diagram of a partial sample PS and a manually entered label MLB associated with the partial sample PS according to one embodiment of the present disclosure. 【0030】 As shown in Figure 3D, the text content of text samples TS11, TS6, TS9, TS13, and TS5 included in the text content of partial sample PS may be displayed on the display 140 shown in Figure 1. The label maker can read the text content related to text samples TS11, TS6, TS9, TS13, and TS5 and then input a manually entered label MLB related to the partial sample PS via the input interface 120. In some embodiments, the label maker may choose to remove text samples TS11, TS6, and TS13 (not related to a medical topic) and retain text samples TS9 and TS5 (related to a medical topic). 【0031】 As shown in Figures 1, 2 and 3D, the input interface 120 receives a manual input command (related to the manual input label MLB) and performs step S250, which is used to specify the manual input label MLB for the partial sample. As shown in Figure 3D, the manual input label MLB includes the retained labels in the character samples TS5 and TS9, so the character samples TS5 and TS9 are retained in the partial sample PS. KP It may be considered as such. On the other hand, as shown in Figure 3D, the manually entered label MLB includes the removal label in the character samples TS6, TS11 and TS13, so the character samples TS6, TS11 and TS13 are the removal samples TS within the partial sample. RMV It may be considered as such. 【0032】 As shown in Figures 3C and 3D, the partial sample PS (evaluated by an outlier detection algorithm based on the outlier-normal value order RK) that tends to belong to the outlier sample is the entire character sample TS1-TS14. from This may be selected. In such cases, the labeler does not need to manually label each character sample TS1-TS14. Partial sample PSs that tend to belong to the outlier samples are presented to the labeler. Other character samples not included in the partial sample PSs (e.g., character samples TS14, TS1, TS12, TS4, TS10, TS8, TS3, TS2, and TS7 in the embodiment shown in Figure 3C) are considered unlabeled sample ULSs. The number of partial sample PSs is less than the number of unlabeled sample ULSs. In actual operation, since the dataset DB may contain 10,000 character samples, the number of partial sample PSs will be significantly less than the number of unlabeled sample ULSs, and the partial sample PSs will represent only a small fraction of the total character samples compared to all the character samples in the dataset DB. 【0033】 In the above embodiment, the process of collecting manual labels in one round is described. In some other embodiments, for label-free samples ULS without labels (e.g., character samples TS14, TS1, TS12, TS4, TS10, TS8, TS3, TS2, and TS7 in the embodiment shown in FIG. 3C), step S230 (generating an outlier-normal value order for another label-free sample ULS), step S240 (selecting another set of partial samples), and step S250 (receiving another set of manual input labels MLB) may be performed. In some embodiments, steps S230, S240, and S250 may be repeated a certain number of times. 【0034】 As shown in FIGS. 1 and 2, the processing unit 110 executes step S260 of generating a prompt message based on the partial sample PS with manual input label MLB and the label-free sample ULS of the character sample TS. 【0035】 Please further refer to FIG. 3E. FIG. 3E is a diagram showing how to generate a prompt message PM based on the partial sample PS with manual input label MLB and the label-free sample ULS of the character sample TS according to an embodiment of the present disclosure. 【0036】 In some embodiments, the prompt message PM may include anchor data D AN As shown in FIGS. 2 and 3E, in some embodiments, step S260 further includes steps S261, S262, and S263 to generate the anchor data D AN of the prompt message PM. The processing unit 110 calculates the cluster distribution of the retained sample TS KP and selects the first anchor sample TS KP from the retained sample TS KP based on the cluster distribution of the retained sample TS KAN by executing step S261. In some embodiments, the retained sample TS KP adjacent to the cluster center of the retained sample TS KP is the first anchor sample TSKAN Selected as: First anchor sample TS KAN Other retained samples TS that have not been selected KP The remaining retained sample TS Kleft It is considered to be. 【0037】 The processing unit 110 removes the sample TS RMV Calculate the cluster distribution and remove the sample TS RMV Based on the cluster distribution, remove the sample TS RMV From the second anchor sample TS RAN Step S262 is performed to select the sample TS. In some embodiments, the sample TS is removed. RMV Removed sample TS adjacent to the cluster center RMV This is the second anchor sample TS RAN Selected as: First anchor sample TS KAN Other retained samples TS that have not been selected KP The remaining retained sample TS Kleft It is considered that the processing unit 110 processes the first anchor sample TS KAN and the second anchor sample TS RAN Merging the two, the anchor data D in the prompt message PM. AN Step S263 is performed to form. 【0038】 In some embodiments, the prompt message PM is for unlabeled data D UL It may include unlabeled data D. UL This may be a mixture of unlabeled sample ULS and several calibration samples. As shown in Figures 2 and 3E, in some embodiments, step S260 prompts the unlabeled data D UL To generate the remaining sample TS, the process further includes steps S264, S265 and S266. The processing unit 110 then processes the remaining sample TS. Kleft (For example, the first anchor sample TS KAN Retained sample TS not selected as KPThe similarity between the unlabeled sample ULS and the first calibration sample TS is calculated. Kcal The remaining retained sample TS Kleft Perform step S264 to select from. In some embodiments, several remaining retained samples TS are similar to the unlabeled sample ULS. Kleft This is the first calibration sample TS Kcal It is selected as such. 【0039】 Processing unit 110 processes the remaining removal sample TS Rleft (For example, the second anchor sample TS RAN Removal sample TS that was not selected as RMV The similarity between the unlabeled sample ULS and the second calibration sample TS is calculated. Rcal Remove the remaining sample TS Rleft Perform step S265 to select from. In some embodiments, remove several remaining samples TS as well as the unlabeled sample ULS. Rleft This is the second calibration sample TS Rcal It is selected as such. 【0040】 Processing unit 110 processes the first calibration sample TS Kcal and the second calibration sample TS Rcal The unlabeled data D is mixed with the unlabeled sample ULS in the prompt message PM. UL Step S266 is performed to form. 【0041】 Please refer further to Figure 4. Figure 4 is a schematic diagram of the prompt message PM for step S260 (including steps S261-S266 shown in Figure 3E) according to one embodiment of the present disclosure. ru. As shown in Figure 4, the prompt message PM is the anchor data D generated based on the task instruction INST and partial sample PS. AN Anchor label data LB AN and unlabeled data D UL Includes. 【0042】 As shown in Figure 4, the task instruction INST within the prompt message PM is used to notify the generative pretraining transformer model 190 to identify outliers in a specific dataset. As shown in Figure 4, the task instruction INST instructs the generative pretraining transformer model 190 to act as a data analyst and anchor data D AN and anchor label data LB AN Based on the prompt, unlabeled data D UL This request asks you to predict the normal / outlier labels for the character samples within. 【0043】 In the embodiments shown in Figures 3D and 4, the removal sample TS RMV Character samples TS11 and TS6, and retained sample TS KP The character sample TS9 is anchor data D within the prompt message PM. AN Used to form the manually entered labels associated with character samples TS11, TS6, and TS9, and the anchor label data LB within the prompt message PM. AN It is used to form. 【0044】 In the embodiments shown in Figures 3D and 4, the retained sample TS KP The character sample TS5 is the first proofreading sample TS Kcal Used to form and remove sample TS RMV The character sample TS13 is the second proofreading sample TS Rcal Used to form the first calibration sample TS. Kcal , 2nd calibration sample TS Rcal This is mixed with other unlabeled samples ULS, and unlabeled data D in the prompt message PM. UL It forms. 【0045】 As shown in Figures 1, 2, and 4, the processing unit 110 and the communication circuit 150 provide a prompt message PM to the generative pre-training transformer model 190, and unlabeled data D UL(Unlabeled sample ULS, first calibration sample TS) Kcal and 2 calibration sample TS Rcal Normal-outlier prediction labels LB (including) PRED Step S270 is performed to generate the following. In some embodiments, the generative pretraining transformer model 190 generates anchor data D in the prompt message PM. AN and anchor label data LB AN Based on the prompt, unlabeled data D UL Normal-outlier prediction labels related to LB PRED Generates normal-outlier prediction labels LB. PRED This is for each unlabeled data D UL It is possible to predict whether a value is normal or an outlier. 【0046】 In some embodiments, the manually entered labels MLB associated with character samples TS5 and TS13 are not added to the content of the prompt message PM. First proofreading sample TS Kcal and the second calibration sample TS Rcal The normal-outlier predicted label LB is generated by the generative pre-trained transformer model 190. PRED It is used to verify the confidence level related to the first calibration sample TS. In other words, the first calibration sample TS is used to verify the confidence level related to the first calibration sample TS. Kcal and the second calibration sample TS Rcal The normal-outlier predicted label LB is generated by the generative pre-trained transformer model 190. PRED It may be used to evaluate the accuracy of [the product / service]. 【0047】 Normal-outlier predicted labels LB generated by the Generative Pre-trained Transformer Model 190 PRED This is a calibration sample (for example, the first calibration sample TS). Kcal and the second calibration sample TS Rcal Calibration prediction label LB in ) calIt includes. The processing unit 110 is used to compare the manual input label MLB and the calibration prediction label LB related to the character samples TS5 and TS13 cal and is used for this purpose. The closer the manual input label MLB matches the calibration prediction label LB cal , the higher the confidence level. The less the manual input label MLB matches the calibration prediction label LB cal , the lower the confidence level. 【0048】 In some embodiments, the electronic device 100 is used by the generative pre-training transformer model 190 to collect the normal value - outlier prediction label LB PRED , and then the processing unit 110 is used to remove the character samples with outlier labels from the dataset DB, so that all the remaining character samples in the dataset DB are clean data without noise data (for example, normal value data). 【0049】 Based on the above embodiments, some partial samples PS are extracted from the dataset DB of the entire character samples. It is necessary to manually label a small number of partial samples PS. The partial sample PS is processed to provide the anchor data D AN and the calibration data (for example, the first calibration sample TS Kcal and the second calibration sample TS Rcal ) in the prompt message PM. The prompt message PM is used to trigger the generative pre-training transformer model 190 to generate the normal value - outlier prediction label LB PRED related to a large number of unlabeled samples ULS. Thereby, the heavy burden when the labeler manually labels a large number of character samples can be avoided. The labeler only needs to label only a very small number of character samples, and most of the character samples may be automatically processed by the data classification method 200. 【0050】 Please refer further to Figure 5, which is a flowchart of a data classification method 500 according to another embodiment of the present disclosure. The data classification method 500 may be performed by the electronic device 100 shown in Figure 1. Similar to the data classification method 200 shown in Figure 2, the data classification method 500 is used to generate normal-outlier predictions associated with character samples in a dataset DB. 【0051】 As shown in Figure 5, the data classification method 500 includes steps S510, S520, S530, S540, S550, S560, S565, S570, S575, and S580. Steps S510, S520, S530, S540, and S550 of the data classification method 500 in Figure 5 are similar to steps S210, S220, S230, S240, and S250 of the data classification method 200 in Figure 2, and these steps have been sufficiently disclosed in the embodiment shown in Figure 2, so the details of steps S510, S520, S530, S540, and S550 will not be repeated here. 【0052】 As shown in Figures 3D and 5, in step S550, the manually entered label MLB includes the retained labels in character samples TS5 and TS9, so character samples TS5 and TS9 are retained samples TS in the partial sample PS. KP It may be considered as such. On the other hand, since the manually entered label MLB includes the removed labels in the character samples TS6, TS11 and TS13, the character samples TS6, TS11 and TS13 are removed samples TS in the partial sample PS. RMV It may be considered as such. 【0053】 As shown in Figures 1 and 5, the processing unit 110 performs step S560, which generates a first prompt message including a partial sample with a manually entered label and a feature engineering task instruction. 【0054】 Please further refer to FIG. 6. FIG. 6 is a schematic diagram of the first prompt message PM1 generated in step S560 according to an embodiment of the present disclosure. As shown in FIG. 6, the first prompt message PM1 includes a feature engineering task instruction INST1, a retention sample TS KP and a removal sample TS RMV . 【0055】 The retention sample TS KP and the removal sample TS RMV are partial samples PS with manual input labels MLB and are used to provide a plurality of prompts to the generative pre-training transformer model 190 to realize a feature engineering task. 【0056】 The feature engineering task instruction INST1 in the first prompt message PM1 is used to trigger the generative pre-training transformer model 190 to generate a distinguishable feature DF that can distinguish both the removal sample TS RMV and the retention sample TS KP . 【0057】 As shown in FIGS. 1, 5, and 6, the processing unit 110 and the communication circuit 150 execute step S565 of providing the first prompt message PM1 to the generative pre-training transformer model 190 to generate a distinguishable feature DF. As shown in FIG. 6, the generative pre-training transformer model 190 generates a feature set of 10 distinguishable features DF based on the first prompt message PM1. 【0058】 As shown in FIGS. 1 and 5, the processing unit 110 executes step S570 of generating a second prompt message including a distinguishable feature DF, a character sample, and a feature scoring task instruction. 【0059】 Please refer further to Figure 7. Figure 7 is a schematic diagram of a second prompt message PM2 generated in step S570 according to one embodiment of the present disclosure. As shown in Figure 7, the second prompt message PM2 includes a feature scoring task instruction INST2, a character sample TS, and a distinguishable feature DF. 【0060】 In some embodiments, the feature scoring task instruction INST2 within the second prompt message PM2 is used to trigger a generative pre-trained transformer model to recognize whether a character sample TS has attributes that match possible options of a distinguishable feature DF. 【0061】 As shown in Figures 1, 5, and 7, the processing unit 110 and the communication circuit 150 provide the generative pre-training transformer model 190 with a second prompt message PM2 to predict a feature F for the distinguishable feature DF of the character sample TS. PRED Generation Step S575 is performed for use in the feature prediction F, as shown in Figure 7. PRED This indicates that character sample TS1 is an English text paragraph related to a health topic, and character sample TS2 is a Chinese text paragraph related to a health topic. In some embodiments, the distinguishable feature DF is a human-recognizable feature (e.g., language, topic, time, length), as shown in Figures 6 and 7. In other embodiments, the distinguishable feature DF may include features that are not human-recognizable, such as latent vectors. 【0062】 As shown in Figures 1 and 5, the processing unit 110 performs step S580, which executes a classification algorithm to generate normal-outlier prediction labels associated with the unlabeled samples. 【0063】 Please refer further to Figure 8. Figure 8 shows how the classification algorithm predicts normal-outlier labels LB in step S580 according to one embodiment of the present disclosure. PRED This is a diagram showing how to generate it. 【0064】 As shown in Figure 8, the classification algorithm in step S580 is feature prediction F of the partial sample PS. PRED1 Feature prediction F for unlabeled ULS samples PRED2 This is performed based on the normal-outlier prediction label LB associated with the unlabeled sample ULS. PRED This generates the following. In one embodiment, the character sample TS includes a partial sample PS and an unlabeled sample ULS, so the feature prediction F of the partial sample PS in Figure 8 is shown. PRED1 Feature prediction F for unlabeled ULS samples PRED2 Both are feature predictions F of the character sample TS in Figure 7. PRED It is obtained from. 【0065】 As shown in Figure 8, the classification algorithm in step S580 is feature prediction F of the partial sample PS. PRED1 This is performed based on training data TD including manually entered labels MLB for partial sample PS, thereby predicting features F for unlabeled sample ULS. PRED2 Based on the normal-outlier predicted labels LB associated with unlabeled sample ULS PRED Generates. 【0066】 For example, one unlabeled sample ULS is used to predict the features of a character sample TS11 F PRED1 A similar set of feature predictions F PRED2 If it has the following characteristics, the classification algorithm can classify this unlabeled sample ULS as an outlier. One unlabeled sample ULS is a feature prediction F of the character sample TS9. PRED1 A similar set of feature predictions F PRED2 If this is the case, the classification algorithm can classify this unlabeled sample ULS as a normal value. In some embodiments, the classification algorithm in step S580 may be implemented by an extreme gradient boosting algorithm, a classification boosting algorithm, or a random forest algorithm. 【0067】 In some embodiments, the extreme gradient boosting algorithm can generate a predictive score for each character sample. If a character sample's predictive score is greater than 0.5 (e.g., predictive score = 0.7), the character sample may be labeled as normal, and its confidence level may be equal to its score (e.g., confidence level = 0.7). If a character sample's predictive score is lower than 0.5 (e.g., predictive score = 0.2), the character sample may be labeled as an outlier, and its confidence level may be equal to the complement of its score (e.g., confidence level = 1 - 0.2 = 0.8). 【0068】 Based on the above embodiment, several partial sample PSs are extracted from the entire character sample dataset DB. A small number of partial sample PSs need to be manually labeled. The partial sample PSs are processed to generate a first prompt message PM1 and a second prompt message PM2. The first prompt message PM1 and the second prompt message PM2 are used to trigger the generative pre-trained transformer model 190 to perform feature engineering and feature scoring tasks. Based on feedback from the generative pre-trained transformer model 190, normal-outlier prediction labels LB are generated for a large number of unlabeled sample ULSs. PRED This allows for rapid generation. In this way, the significant burden on experts when manually labeling large numbers of character samples can be avoided. Labelers only need to label a small fraction of the character samples, and most of the character samples may be processed automatically by the data classification method 500. 【0069】 The specification and patent application use specific terminology to refer to certain elements. However, those skilled in the art should understand that the same element may be referred to by different nouns. The specification and patent application use differences in function as the criterion for distinguishing elements, rather than using differences in name. The word "includes" as used in the specification and patent application is an open term and should be interpreted as "includes, but not limited to." 【0070】 Furthermore, unless otherwise specified herein, any singular term simultaneously implies the plural meaning. 【0071】 The foregoing are merely preferred embodiments of the present disclosure, and various modifications and equivalent changes can be made to the present disclosure without departing from the scope or spirit of the present disclosure. Accordingly, all modifications and equivalent changes made to the present disclosure within the scope of the following claims are all covered by the present disclosure. [Explanation of Symbols] 【0072】 100:Electronic equipment 110: Processing Unit 120: Input Interface 130: Memory Unit 140: Display 150: Communication Circuit 190: Generative Pre-Training Transformer Model DB: Dataset MLB: Manually entered labels PS: Partial sample 200, 500: Data classification method S210, S220, S230, S240, S250, S260, S261, S262, S263, S264, S265, S266, S270, S510, S520, S530, S540, S550, S560, S565, S570, S575, S580: Step TS1-TS14: Character Samples SS: Semantic Space eTS1-TS14: Text embedding ULS: Unlabeled sample RK: Outlier-Normality Order TS RMV :Removal sample TS KP :Retained sample TS Rleft Remaining removal samples TS KAN : First Anchor Sample TS Kcal :First proofreading sample TS RAN : Second anchor sample TS Rcal : Second calibration sample D UL : Unlabeled data PM: Prompt message D AN : Anchor data LB AN : Anchor label data INST: Task Instruction LB PRED : Normal value - Outlier prediction label LB cal : Calibration prediction label PM1: First prompt message PM2: Second prompt message INST1: Feature Engineering Task Instructions DF: Distinguishing features INST2: Feature Scoring Task Instruction F PRED F PRED1 F PRED2 Feature prediction TS: Character Sample TD: Training Data
Claims
[Claim 1] A data classification method performed by a processing circuit, Steps to obtain multiple character samples from the dataset, The steps include converting the aforementioned multiple character samples into multiple character embeddings in semantic space, The steps include generating an outlier-normal order of the plurality of character samples based on an outlier detection algorithm, based on the distance between the plurality of character embeddings in the semantic space, The steps include selecting a plurality of subsamples from the plurality of character samples based on the outlier-normal value order, The steps include receiving a manual input command via an input interface to specify multiple manual input labels for the multiple partial samples, A step of generating a prompt message including a task instruction, unlabeled data, and anchor data generated based on the multiple partial samples of the multiple manual labels and the multiple unlabeled samples within the multiple character samples, The steps include providing the prompt message to a generative pre-training transformer model to generate a plurality of normal-outlier prediction labels associated with the plurality of unlabeled samples, A data classification method that includes this. [Claim 2] A data classification method according to claim 1, wherein at least one of the plurality of character samples that tends to belong to an outlier sample is selected as the plurality of partial samples based on the outlier-normal value order, and the number of the plurality of partial samples is less than the number of the plurality of unlabeled samples. [Claim 3] The manual input command is used to specify multiple retaining labels for multiple retaining samples within the multiple partial samples, and multiple removal labels for multiple removal samples within the multiple partial samples, and the method for generating the prompt message is: The steps include selecting a plurality of first anchor samples from the plurality of retained samples based on the cluster distribution of the plurality of retained samples, The steps include selecting a plurality of second anchor samples from the plurality of removed samples based on the cluster distribution of the plurality of removed samples, The steps include: merging the plurality of first anchor samples and the plurality of second anchor samples to form the anchor data in the prompt message; A data classification method according to claim 1, including the following: [Claim 4] The method for generating the aforementioned prompt message is: The steps include selecting a plurality of first calibration samples from the plurality of holding samples that have not been selected as the plurality of first anchor samples, The steps include selecting a plurality of second calibration samples from the plurality of removal samples that have not been selected as the plurality of second anchor samples, The steps of forming the unlabeled data in the prompt message based on a mixture of the plurality of unlabeled samples, the plurality of first calibration samples, and the plurality of second calibration samples, The data classification method according to claim 3, further comprising: [Claim 5] The data classification method according to claim 4, wherein the plurality of first calibration samples and the plurality of second calibration samples are used to verify the confidence level of the plurality of normal value-outlier prediction labels generated by the generative pretraining transformer model. [Claim 6] The method for selecting the aforementioned plurality of first calibration samples is: A step of comparing multiple similarities between the multiple retained samples that have not been selected as the multiple first anchor samples and the multiple unlabeled samples, The steps include selecting a plurality of first calibration samples based on the plurality of similarity scores, The data classification method according to claim 4, including the following: [Claim 7] The data classification method according to claim 1, wherein the task instruction in the prompt message is used to notify the generative pretraining transformer model to identify a number of outliers in a particular dataset. [Claim 8] The data classification method according to claim 1, wherein the outlier detection algorithm is implemented by a random sampling consistency-NN algorithm, an isolation forest algorithm, or a local outlier factor algorithm. [Claim 9] The data classification method according to claim 1, wherein each of the aforementioned character samples includes a character paragraph or a combination of questions and answers. [Claim 10] A data classification method performed by a processing circuit, Steps to obtain multiple character samples from the dataset, The steps include converting the aforementioned multiple character samples into multiple character embeddings in semantic space, The steps include generating an outlier-normal order of the plurality of character samples based on an outlier detection algorithm, based on the distance between the plurality of character embeddings in the semantic space, The steps include selecting a plurality of subsamples from the plurality of character samples based on the outlier-normal value order, The steps include receiving a manual input command via an input interface to specify multiple manual input labels for the multiple partial samples, A step of generating a first prompt message including the plurality of partial samples having the plurality of manual input labels and a feature engineering task instruction, The steps include providing the first prompt message to the generative pre-training transformer model to generate multiple distinguishable features, A step of generating a second prompt message including the plurality of distinguishable features, the plurality of character samples, and a feature scoring task instruction, The steps include providing the second prompt message to the generative pre-training transformer model to generate multiple feature predictions for the multiple distinguishable features of the multiple character samples, A step of running a classification algorithm based on the multiple feature predictions of the multiple subsamples and multiple unlabeled samples within the multiple character samples to generate multiple normal value-outlier prediction labels for the multiple unlabeled samples, A data classification method that includes this. [Claim 11] The data classification method according to claim 10, wherein at least one of the plurality of character samples that tend to belong to an outlier sample based on the outlier-normal order is selected as one of the plurality of subsamples, and the number of the plurality of subsamples is less than the number of the plurality of unlabeled samples. [Claim 12] The data classification method according to claim 10, wherein the manual input command is used to specify a plurality of retaining labels for a plurality of retaining samples within the plurality of partial samples, and a plurality of removal labels for a plurality of removal samples within the plurality of partial samples. [Claim 13] The data classification method according to claim 12, wherein the feature engineering task instruction in the first prompt message is used to trigger the generative pretrained transformer model to generate the plurality of distinguishable features for separating the plurality of removal samples from the plurality of retention samples. [Claim 14] The data classification method according to claim 10, wherein the feature scoring task instruction in the second prompt message is used to trigger the generative pre-trained transformer model to distinguish whether the plurality of character samples have a plurality of attributes of the plurality of distinguishable features. [Claim 15] The data classification method according to claim 10, wherein the classification algorithm is performed on training data, the training data includes the plurality of feature predictions for the plurality of partial samples and the plurality of manually entered labels for the plurality of partial samples, thereby generating the plurality of normal value-outlier prediction labels associated with the plurality of unlabeled samples based on the plurality of feature predictions for the plurality of unlabeled samples. [Claim 16] The data classification method according to claim 10, wherein the classification algorithm is implemented by an extreme gradient boosting algorithm, a classification boosting algorithm, or a random forest algorithm. [Claim 17] The data classification method according to claim 10, wherein the outlier detection algorithm is implemented by a random sampling consistency-NN algorithm, an isolation forest algorithm, or a local outlier factor algorithm. [Claim 18] The data classification method according to claim 10, wherein each of the aforementioned character samples includes a character paragraph or a combination of questions and answers.