Data processing device, data processing method, and data processing program

The data processing device improves the efficiency and accuracy of labeling learning data by using a natural language processing model to assign pseudo-labels with confidence calculations and filtering, addressing the high cost of WB prediction in machine learning models.

WO2026126305A1PCT designated stage Publication Date: 2026-06-18NT T INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NT T INC
Filing Date
2024-12-09
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

The high cost of labeling learning data for predicting well-being (WB) using machine learning models due to the dependency on individual situations and subjective evaluations.

Method used

A data processing device and method that utilizes a natural language processing model to assign pseudo-labels with confidence calculations and filtering, enhancing the accuracy and efficiency of labeling by considering related data and filtering based on confidence levels.

🎯Benefits of technology

Reduces the cost of labeling learning data by extracting high-confidence labeled data for training, leading to the construction of a highly accurate machine learning model for WB prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2024043472_18062026_PF_FP_ABST
    Figure JP2024043472_18062026_PF_FP_ABST
Patent Text Reader

Abstract

This data processing device: inputs, to a natural language processing model, an instruction to assign, to respective data items in a data group that are candidates for training data of a machine learning model, labels indicating evaluation results for the data items, and an instruction to refer to data related to the data items in the data group when assigning the labels to the data items, thereby causing the natural language processing model to assign, to the data items, the labels indicating the evaluation results for the data items; and receives label assignment results for the respective data items from the natural language processing model. Thereafter, the data processing device calculates the degrees of certainty of the labels assigned to the respective data items, on the basis of the distribution of the predictive probabilities of the labels outputted from the natural language processing model for the data items. Furthermore, the data processing device filters a data item having a label of which the degree of certainty is a prescribed value or more. Thereafter, the data processing device adds, to the training data of the machine learning model, the filtered data item having the label of which the degree of certainty is the prescribed value or more.
Need to check novelty before this filing date? Find Prior Art

Description

Data processing device, data processing method, and data processing program 【0001】 The present invention relates to a data processing device, a data processing method, and a data processing program. 【0002】 In recent years, well-being (WB) has attracted attention. Because WB is psychological, it is difficult to measure. Currently, the most reliable method used to measure WB is to conduct a survey. However, conducting a survey every time is burdensome for both those who create the survey and those who answer it. Therefore, it is conceivable to use machine learning to predict the results of surveys. When performing supervised learning to predict survey results, it is necessary to label the training data with the survey results. 【0003】 Daniel Preotiuc-Pietro et al., "Modelling Valence and Arousal in Facebook posts," Proceedings of NAACL-HLT 2016, pages 9–15, [online], [Retrieved December 6, 2024], Internet <URL: https: / / aclanthology.org / W16-0404 / > Embeddings-OpenAI API, [online], [Retrieved December 6, 2024], Internet <URL: https: / / platform.openai.com / docs / guides / embeddings> Cheng-Han Chiang et al, "Can Large Language Models Be an Alternative to Human Evaluation?," 2023, [online], [Retrieved December 6, 2024], Internet <URL: https: / / arxiv.org / abs / 2305.01937> 【0004】However, since WB depends on the situation of the person being measured for WB, etc., the person being measured for WB himself / herself needs to attach a label. Therefore, there was a problem that the cost of labeling learning data became high and the construction cost of a machine learning model for predicting WB became high. Thus, an object of the present invention is to solve the above-described problem and reduce the cost of labeling learning data used for learning a machine learning model for predicting WB and the like. 【0005】 In order to solve the above-described problem, the present invention inputs, into a natural language processing model, an instruction to attach a label indicating an evaluation result of each data in a data group of candidates for learning data of a machine learning model, and an instruction to refer to data related to the data in the data group when attaching a label to the data, thereby causing the natural language processing model to have a labeling unit that attaches a label indicating an evaluation result of the data to the data, a confidence calculation unit that calculates the confidence of the label attached to the data based on the distribution of the prediction probabilities of each label output from the natural language processing model, a filtering unit that filters data with a confidence of the label being equal to or higher than a predetermined value, and a data expansion unit that adds the labeled data with a confidence of the label being equal to or higher than the predetermined value, which has been filtered, to the learning data of the machine learning model. 【0006】 According to the present invention, it is possible to reduce the cost of labeling learning data used for learning a machine learning model. 【0007】 FIG. 1 is a diagram showing an example of the distribution of prediction probabilities of each label. FIG. 2 is a diagram for explaining the outline of a data processing device. FIG. 3 is a diagram showing a configuration example of a data processing device. FIG. 4 is a diagram showing an example of related information. FIG. 5 is a flowchart showing an example of a processing procedure executed by a data processing device. FIG. 6 is a diagram showing a part of a data set used in an experiment. FIG. 7 is a diagram showing a part of the results of an evaluation experiment. FIG. 8 is a diagram showing the results of an evaluation experiment. FIG. 9 is a diagram showing an example of a computer that executes a data processing program. 【0008】The following describes embodiments for carrying out the present invention with reference to the drawings. The present invention is not limited to these embodiments. 【0009】 [Overview] Next, an overview of the data processing device of this embodiment will be described. In the following, the natural language processing model used by the data processing device for labeling data will be described as an LLM as an example, but it is not limited to this. 【0010】 The data processing device uses LLM to create labeled data for training a machine learning model. In this embodiment, the LLM assigns pseudo-labels to the input data (e.g., text data). The labels are, for example, the valence (sentimental value) for the data, and are on a 10-point scale from 1 to 10. The LLM also outputs the predicted probability for each label in the input data. For example, if the labels are any of 1 to 10, the LLM outputs the predicted probability for each label (1 to 10) in the data. 【0011】 The data processing unit inputs a set of candidate training data (for example, a set of text data) into the LLM mentioned above. The data processing unit then obtains labeled data and the distribution of the predicted probabilities for each label in that data from the LLM. Subsequently, the data processing unit calculates the confidence level of the labels assigned by the LLM based on the predicted distribution of each label obtained from the LLM. Finally, the data processing unit filters the LLM-labeled data set to contain data with labels whose confidence level is above a predetermined value. 【0012】For example, as shown in Figure 1, if the distribution of predicted probabilities for each label output by the LLM is sharp (i.e., the predicted probability of a particular label is exceptionally high), the confidence level of the label prediction by the LLM is considered high. Therefore, the data processing device filters out data where the distribution of predicted probabilities for each label is sharp, as described above, as data with high confidence in the label prediction by the LLM. For example, the data processing device takes the difference between the highest and second highest predicted probability values ​​for each label as the confidence level of the label prediction, and extracts data with labels that have high confidence levels. The data processing device then adds the extracted data to the training data of the machine learning model. Subsequently, by training the machine learning model using the above training data, a highly accurate machine learning model can be constructed. 【0013】 For example, if the labels assigned to the data (evaluation results of the data) are values ​​related to human emotions such as WB, the label values ​​depend on the person and the situation. Therefore, the confidence level of the label prediction (estimation) by LLM tends to be low. As a result, if the data processing device extracts data with labels whose confidence level is above a predetermined value from the dataset of candidate training data, it may not be able to extract enough data to train the machine learning model. 【0014】 Therefore, the data processing device inputs a set of candidate training data (for example, a set of text data) into the LLM, and when it has the LLM predict the label for each data point, it also takes into account related information about the data for which the label is to be predicted. 【0015】 For example, if the candidate data set for training consists of user utterances, the data processing unit instructs the LLM to predict labels by also referring to utterances preceding the utterance being predicted. This allows the data processing unit to improve the accuracy of label predictions for the candidate data in the training dataset, as well as the confidence level of the label predictions (see experimental results described later). 【0016】For example, consider a case where the candidate data set for training data consists of text, and the labels to be assigned to the text are the results of evaluations of that text. In this case, the data processing device inputs the text to be evaluated, the evaluation axes used to evaluate the text, and instructions for related information to be considered in the evaluation of the text into the LLM, as shown in Figure 2, and has the LLM evaluate the text. This allows the data processing device to improve the accuracy of the text evaluation results by the LLM. It also improves the confidence level of the text evaluation results by the LLM. 【0017】 As a result, when the data processing device extracts data with a confidence level above a predetermined value from a set of candidate training data, it can extract a large amount of data with high labeling accuracy. Subsequently, the data processing device adds the extracted data to the training data of a machine learning model and trains the machine learning model, thereby constructing a machine learning model with high label prediction accuracy. 【0018】 [Configuration Example] Next, an example configuration of the data processing device 10 will be described using Figure 3. The data processing device 10 includes, for example, an input / output unit 11, a storage unit 12, and a control unit 13. 【0019】 The input / output unit 11 is an interface that handles the input and output of various types of data. For example, the input / output unit 11 accepts input of a set of data that can serve as candidate training data for a machine learning model. This set of data may be, for example, a chronological arrangement of text data showing the content of each speaker's remarks in a meeting. 【0020】 The storage unit 12 stores data, programs, etc., that are referenced when the control unit 13 performs various processes. The storage unit 12 is implemented by semiconductor memory elements such as RAM (Random Access Memory) and flash memory, or by storage devices such as hard disks and optical discs. For example, the storage unit 12 stores data groups, etc., received by the input / output unit 11. 【0021】The control unit 13 is responsible for controlling the entire data processing device 10. The functions of the control unit 13 are realized, for example, by the CPU (Central Processing Unit) executing a program stored in the memory unit 12. 【0022】 The control unit 13 includes, for example, a labeling unit 130, a confidence calculation unit 131, a filtering unit 132, and a data expansion unit 133. The learning unit 134 may or may not be included; the case where it is included will be described later. 【0023】 The labeling unit 130 uses LLM to assign a label (pseudo-label) to each data point in a candidate dataset of training data for a machine learning model. For example, the labeling unit 130 uses LLM to predict the evaluation value of the data in the dataset and assigns a label to the data indicating the predicted evaluation value. 【0024】 For example, the labeling unit 130 inputs to the LLM an instruction to predict the evaluation value of each data in the above data group and to assign a label to the data corresponding to the predicted evaluation value, and an instruction to refer to related data (related information) in the above data group when predicting the evaluation value of the data. Subsequently, the labeling unit 130 receives the labeling results output from the LLM. 【0025】 Related information can be defined in various ways, but for example, if the above data set is a set of texts representing user statements (see Figure 4), then related information could include texts of statements made before the statement to which a label is assigned, in chronological order. For example, if the text of the statement to which a label is assigned is the text shown in reference numeral 401, then the related information would be the series of texts shown in reference numeral 402. 【0026】 In this case, the labeling unit 130 uses the text of previous statements (see reference numeral 402) as related information to instruct the LLM to assign a label to the target text (see reference numeral 401) indicating the predicted satisfaction level of the statement. To this end, the LLM inputs, for example, the following prompt (instruction): 【0027】• The previously spoken text is <Previous Text>. • Please assign a score (label) of 1 to 5 for "Satisfaction Level" to the following text. • <Target Text> 【0028】 This allows LLM to assign labels to texts by taking into account previous statements. As a result, LLM can assign labels to texts with high accuracy. 【0029】 For example, if the data input to the LLM is the text of the recognition results of the user's spoken audio data, it may contain a misrecognized phrase ("fitbit formation") as shown by reference numeral 601 in Figure 6. 【0030】 Here, the labeling unit 130 instructs the LLM to use the previous statements as related information to perform labeling, thereby improving the accuracy of labeling even when the data to be labeled includes, for example, a misrecognized phrase ("fitbit formation") as shown in reference numeral 601. This is because, even if the data to be labeled includes a misrecognized phrase, if the related information includes a correctly recognized phrase (for example, "Fitbit experiment" included in the statement shown in reference numeral 602), the LLM may be able to understand the original meaning of the misrecognized phrase. 【0031】 In the example above, the data set input to the LLM was described as including the identification information of the speaker (user) of each statement, as shown in Figure 6, but it may also not include the speaker's identification information. 【0032】 Returning to the explanation of Figure 3, the confidence calculation unit 131 calculates the confidence level of the labels assigned by the LLM. For example, the confidence calculation unit 131 obtains the predicted probability of each label for each data point from the LLM. Then, based on the distribution of the predicted probabilities of each label for each data point, the confidence calculation unit 131 calculates the confidence level of the label assigned to that data point. 【0033】For example, the confidence calculation unit 131 calculates the confidence level of a label assigned to data based on the difference in the predicted probabilities of each label in the data. To give one example, the confidence calculation unit 131 calculates the confidence level of a label assigned to data as the difference between the highest predicted probability value and the next highest predicted probability value for each label. 【0034】 Alternatively, the confidence calculation unit 131 may determine the entropy of the distribution of the predicted probabilities of each label in the data set output from the LLM, and calculate the confidence level of the label assigned to the data based on that entropy. 【0035】 The filtering unit 132 filters the data group to which labels have been assigned by the labeling unit 130. For example, the filtering unit 132 filters the data group to which the confidence level of the labels calculated by the confidence level calculation unit 131 is equal to or greater than a predetermined value. 【0036】 The data augmentation unit 133 adds the data filtered by the filtering unit 132 (labeled data) to the training data of the machine learning model. 【0037】 With such a data processing device 10, it is possible to add data with high confidence in labels used for training machine learning models to predict WB, etc., to the training data. As a result, the cost of labeling training data can be reduced. 【0038】 [Example of Processing Procedure] Next, an example of a processing procedure performed by the data processing device will be explained using Figure 5. The labeling unit 130 of the data processing device 10 uses LLM to assign a label to each of the candidate data sets of training data for the machine learning model (S10: Assigning labels to data). 【0039】 After S10, the confidence calculation unit 131 calculates the confidence level of the labels assigned to the data by LLM. For example, the confidence calculation unit 131 obtains the predicted probability of each label for each data from LLM and calculates the difference in predicted probabilities between labels (S11). 【0040】After S11, the filtering unit 132 filters data with a confidence level of a label being greater than or equal to a predetermined value. For example, the filtering unit 132 filters data where the difference in predicted probabilities between the labels calculated in S11 is greater than or equal to a predetermined threshold (S12). 【0041】 After S12, the data augmentation unit 133 adds the data filtered by the filtering unit 132 (labeled data with a confidence level of a label being greater than or equal to a predetermined value) to the training data of the machine learning model (S13). 【0042】 After that, learning of the machine learning model is performed using the training data to which the data has been added. The learning of the machine learning model may be executed by the data processing device 10 or by a device other than the data processing device 10. When the data processing device 10 performs learning of the machine learning model, the data processing device 10 further includes a learning unit 134 (see FIG. 3). The learning unit 134 performs learning of the machine learning model using the training data to which the data has been added by the data augmentation unit 133. 【0043】 According to the data processing device 10 described above, labels can be accurately assigned to the training data of the machine learning model. As a result, the data processing device 10 can construct a machine learning model with high prediction accuracy of labels. 【0044】 [Results of Evaluation Experiment] An evaluation experiment of the prediction accuracy (estimation accuracy) of the labels of the data and the effect of filtering the data by the confidence level by the above data processing device 10 will be described. 【0045】 The data set used in this experiment is a data set in which the results of humans evaluating the satisfaction level of each speech in a meeting on a five-point scale are labeled for each speech (see FIG. 7). Also, GPT-4o-mini was used for the LLM. The comparative example is the estimation accuracy of the label when related information was not used for the estimation of the label by the LLM. 【0046】(1) The estimated accuracy data processing device 10 predicted (estimated) the satisfaction level of each statement in a meeting in five levels using an LLM. Satisfaction level is a psychological score that is relatively difficult to estimate and is also influenced by the subjectivity of the speaker and the evaluator of the satisfaction level. Therefore, with the goal of ensuring that what humans evaluate as "satisfied" is not evaluated as "dissatisfied", the ratio of the estimated values within the range of ±1 of the label given by humans by the LLM was defined as the coincidence rate, and this coincidence rate was used as the evaluation value of the estimated accuracy. 【0047】 For example, as shown in FIG. 7, if the difference between the label given by humans and the label (estimated result) estimated by the LLM is within the range of ±1 or less, it is judged as ○ (correct), and if it exceeds ±1, it is judged as × (incorrect), and the coincidence rate = number of correct answers / number of estimations was calculated. In the example shown in FIG. 7, the coincidence rate is 10 / 14 ≈ 0.71. 【0048】 (2) Effect of data filtering by confidence level. Further, the data processing device 10 calculates the confidence level of the label estimation by the LLM for each data, filters (extracts) the data group when the confidence level is 0.2 or more, 0.4 or more, 0.6 or more, 0.8 or more, and evaluates the coincidence rate of the labels of the extracted data. 【0049】 FIG. 8 shows the results of the evaluation experiment. As shown in FIG. 8, it was confirmed that the estimation accuracy of the label of the data by the data processing device 10 of this embodiment is higher than that of the comparative example. Furthermore, as a result of filtering the data by the confidence level of the label estimation, it was confirmed that the data processing device 10 can extract more data than the comparative example. 【0050】 From this, it was confirmed that the data processing device 10 can extract a large number of data with high-precision labels as learning data for the machine learning model. Therefore, it was shown that the data processing device 10 can construct a machine learning model with high label estimation accuracy. 【0051】[Examples of Application of This Embodiment] The data processing device 10 described above can be applied, for example, as follows. For example, it can be applied when estimating the level of aggression of each user from their statements in an online meeting using a machine learning model. This allows the machine learning model to prompt the user making the statements to exercise restraint if the aggression level estimated by the machine learning model exceeds an acceptable range. Alternatively, the machine learning model may be used to input each user's statements and estimate in real time the level of satisfaction, emotions, activity level, and usefulness of the meeting. 【0052】 [System Configuration, etc.] Furthermore, the components of each part shown in the diagram are functional concepts and do not necessarily need to be physically configured as shown. In other words, the specific forms of distribution and integration of each device are not limited to those shown in the diagram, and all or part of them can be functionally or physically distributed and integrated in any unit according to various loads and usage conditions. In addition, all or any part of the processing functions performed by each device can be realized by a CPU and the program executed on that CPU, or by hardware using wired logic. 【0053】 Furthermore, among the processes described in the embodiments described above, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. 【0054】[Program] The data processing device 10 described above can be implemented by installing a program (data processing program) as packaged software or online software on a desired computer. For example, by having the above program run on an information processing device, the information processing device can be made to function as the data processing device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), as well as terminals such as PDA (Personal Digital Assistant). 【0055】 Figure 9 shows an example of a computer that executes a data processing program. Computer 1000 has, for example, memory 1010 and a CPU 1020. Computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080. 【0056】 Memory 1010 includes ROM (Read Only Memory) 1011 and RAM (Random Access Memory) 1012. ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130. 【0057】The hard disk drive 1090 stores, for example, the OS 1091, application program 1092, program module 1093, and program data 1094. That is, the program that defines each process executed by the data processing device 10 is implemented as a program module 1093 in which executable code for a computer is written. The program module 1093 is stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing a process similar to the functional configuration of the data processing device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD (Solid State Drive). 【0058】 Furthermore, the data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, memory 1010 or hard disk drive 1090. The CPU 1020 then reads the program module 1093 and program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as needed and executes them. 【0059】 Furthermore, the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090; for example, they may be stored in a removable storage medium and read by the CPU 1020 via a disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via a network interface 1070. 【0060】 10 Data processing unit 11 Input / output unit 12 Storage unit 13 Control unit 130 Labeling unit 131 Confidence level calculation unit 132 Filtering unit 133 Data expansion unit 134 Learning unit

Claims

1. A data processing device comprising: a labeling unit that causes a natural language processing model to assign labels to data indicating the evaluation result of data by receiving instructions to the natural language processing model to assign labels to each data in a candidate data set of training data for a machine learning model, and instructions to refer to related data in the data set when assigning labels to the data; a confidence calculation unit that calculates the confidence level of the labels assigned to the data based on the distribution of the predicted probabilities of each label for each data output from the natural language processing model; a filtering unit that filters out data whose confidence level is equal to or greater than a predetermined value; and a data expansion unit that adds the filtered labeled data with a confidence level equal to or greater than a predetermined value to the training data of the machine learning model.

2. The data processing device according to claim 1, characterized in that the data related to the data to which the label is to be assigned is data that appears in the data group in a time series before the data to which the label is to be assigned.

3. The data processing device according to claim 1, characterized in that the data is data indicating the content of the speaker's utterance, and the data related to the data to which the label is assigned is data of an utterance that occurred earlier in time series than the data to which the label is assigned in the data group.

4. The data processing device according to claim 1, characterized in that the confidence calculation unit calculates the confidence level of the label assigned to the data based on the magnitude of the difference in the predicted probabilities of each label in the data.

5. The data processing device according to claim 1, further comprising a learning unit that trains the machine learning model using the training data to which the labeled data has been added by the data expansion unit.

6. The data processing device according to claim 1, characterized in that the label assigned to the data is a label indicating the predicted result of human emotion towards the data.

7. A data processing method performed by a data processing device, comprising the steps of: instructing a natural language processing model to assign a label indicating the evaluation result of a data to each data in a candidate data set of training data for a machine learning model, and instructing the model to refer to related data in the data set when assigning labels to the data, thereby causing the natural language processing model to assign labels indicating the evaluation result of the data to the data; calculating the confidence level of the labels assigned to the data based on the distribution of the predicted probabilities of each label for each data output from the natural language processing model; filtering data whose confidence level is equal to or greater than a predetermined value; and adding the filtered labeled data whose confidence level is equal to or greater than a predetermined value to the training data of the machine learning model.

8. A data processing program for causing a computer to perform the following steps: an instruction to a natural language processing model to assign a label indicating the evaluation result of the data to each data in a candidate dataset of training data for a machine learning model, and an instruction to refer to related data in the dataset when assigning the labels to the data; a step of causing the natural language processing model to assign labels indicating the evaluation result of the data to the data; a step of calculating the confidence level of the labels assigned to the data based on the distribution of the predicted probabilities of each label for each data output from the natural language processing model; a step of filtering out data whose confidence level is equal to or greater than a predetermined value; and a step of adding the filtered labeled data with a confidence level equal to or greater than a predetermined value to the training data of the machine learning model.