Method for generating training data for training customized ai assistant
The method addresses bias in AI assistant training data by preprocessing, generating synthetic data, and iteratively evaluating to achieve a neutral and diverse dataset, enhancing learning efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CLEVI INC
- Filing Date
- 2025-06-12
- Publication Date
- 2026-06-18
Smart Images

Figure KR2025008076_18062026_PF_FP_ABST
Abstract
Description
Method for generating training data for customized AI assistant training
[0001] The present invention relates to a method for generating training data for learning artificial intelligence algorithms, and more specifically, to a method for generating training data for customized AI assistant learning that eliminates bias in the training data to generate a large language model with ensured neutrality.
[0002]
[0003] Recently, artificial intelligence technology has been undergoing rapid advancements, driving innovative changes across various aspects of our lives. In particular, massive AI models that learn from vast amounts of data are further expanding their potential as they approach human capabilities for language understanding, generation, and reasoning; however, challenges that need to be addressed, such as data bias and ethical dilemmas, are also drawing attention.
[0004] Specifically, large artificial intelligence models such as Large Language Models (LLMs) are inevitably directly affected by the quality of the training data; if the training data contains bias or discriminatory information regarding specific groups, the AI model learns this biased information and reflects it as is to derive results.
[0005] For example, AI models trained on data containing biases against specific genders or races may make judgments unfavorable to certain groups, which can lead to serious problems that exacerbate social discrimination; thus, the issue of data bias is recognized as a matter of social equity and ethical responsibility that extends beyond a mere technical issue.
[0006] Meanwhile, recently, personalized artificial intelligence models (AI assistants) have been provided, learning user characteristics to offer AI model services customized to specific users.
[0007] Such AI assistants must be trained and provided in a neutral state without bias when initially offered to be effective in learning user characteristics in the future.
[0008] Therefore, AI assistants are inevitably more sensitive to bias issues than AI models provided for general application, and this requires mechanical neutrality.
[0009] In order to eliminate such bias in training data, research and development are underway to 1) generate new data by utilizing data augmentation techniques to ensure diversity in training data and overcome the limitations of existing data by modifying or synthesizing existing data, and 2) attempts are also being made to mitigate the bias of trained artificial intelligence models by utilizing adversarial training techniques.
[0010] Accordingly, Korean published patents No. 10-10-2023-0060207 and No. 10-2025-0028770 present methods for preventing or eliminating bias in artificial intelligence training data.
[0011] Furthermore, such prior art not only presented general directions but also had the problem of being difficult to apply to fields where mechanical neutrality must be clearly maintained, such as AI assistants.
[0012]
[0013] The present invention was devised to solve the aforementioned conventional problems, and the present invention aims to provide a method for generating training data for customized AI assistant learning that resolves the bias of training data for artificial intelligence learning and enables artificial intelligence to perform learning through a neutral data set.
[0014] Furthermore, the present invention aims to provide a method for generating training data for customized AI assistant training that compensates for the difficulty of collecting neutral training data by generating additional synthetic data from neutral training data, thereby enabling the securing of a sufficient amount of training data for artificial intelligence training.
[0015] Furthermore, the present invention aims to provide a method for generating training data for customized AI assistant training that ensures the flatness of training data by utilizing the embedding vectors of the training data, and secures not only simple sum equilibrium but also distribution equilibrium of the training data, thereby resolving the phenomenon of data clustering.
[0016] The present invention aims to provide a method for generating training data for customized AI assistant training, which trains a personalized AI assistant model through a dataset with ensured neutrality, thereby improving efficiency and accuracy in the AI assistant learning user characteristics.
[0017]
[0018] According to the features of the present invention for achieving the above-mentioned purpose, the present invention comprises a method for generating training data for customized AI assistant training, comprising: (A) a step of collecting raw data for large-scale language model training; (B) a preprocessing step of selecting and excluding data to be excluded from training from the collected raw data; (C) a training data generation step of generating new data from the preprocessed raw data; (D) a training data evaluation step of merging the raw data and the new data to form training data and evaluating bias regarding the training data; (E) a step of repeating steps (B) through (D) so that a pre-set evaluation result is derived; and (F) a step of training an LLM-based AI assistant with the training data that is the result of the execution of step (E).
[0019] At this time, the above step (B) may be performed by including: (B1) a duplicate data removal step for deleting duplicate data from collected raw data; (B2) an embedding information generation step for generating embedding information for raw data from which duplicate data has been deleted; (B3) a data filtering step for filtering and deleting over-biased raw data having embedding vector values (Rn) that fall outside a preset limit range (within maximum allowable deviation (Rs)) by comparing the embedding vector values of the generated embedding information; and (B4) a step of repeating the above step (B3) by changing the maximum allowable deviation (Rs) of the limit range so that the sum average of the embedding vectors of the filtered raw data becomes less than or equal to a preset equilibrium reference value (P).
[0020] And the above step (B1) may be performed by including a process of improving the accuracy of the raw data by correcting words and sentences through a natural language processing model (NLP).
[0021] Additionally, the above step (B) may further include the step of (B5) repeating steps (A) through (B4) to collect additional raw data when the distribution (D) of the embedding values of the filtered raw data deviates from the allowable distribution range (Da).
[0022] And the distribution (D) of the above embedding values is calculated from the following [Equation 1], and
[0023] [Mathematical Formula 1]
[0024] The above Rn is the difference between the embedding vector value of the nth data and the embedding reference value; the above Rs is the maximum allowable deviation that serves as the criterion for over-biased data filtering; and the above m may be the number of training data.
[0025] In addition, the above steps (B2) through (B5) may be performed for each vector dimension of the embedding vector.
[0026] And the new training data generation in the above step (C) may be generated from preprocessed raw data through a synthetic data generation model.
[0027] In addition, the synthetic data generation model may be configured to include a data augmentation model that generates new data by applying modifications to the preprocessed raw data.
[0028] And the above synthetic data generation model may be configured to include a Generative Adversarial Network (GAN) model that generates new data through a network composed of a generator and a discriminator.
[0029] In addition, the synthetic data generation model may be configured to include a variational autoencoder (VAE) model that learns the probability distribution of input data to generate new data.
[0030] And the above step (D) may be performed by including: (D1) a step of improving the accuracy of the training data by correcting words and sentences through a natural language processing (NLP) model; and (D2) a step of evaluating the bias of the training data by analyzing the embedding result values of the training data.
[0031] Additionally, the above step (E) may be performed by including the step of repeating steps (C) through (D) such that when the sum average of the embedding vectors of the training data exceeds a preset equilibrium reference value (P), the sum average of the embedding vectors of the training data becomes less than or equal to the preset equilibrium reference value (P).
[0032] And the above step (E1) may be performed for each vector dimension of the embedding vector.
[0033] Additionally, in step (E), if (E2) the distribution (D) of the embedding values of the training data deviates from the allowable distribution range (Da), steps (A) through (E1) may be repeated to collect additional training data.
[0034] And the above step (E2) may be performed for each vector dimension of the embedding vector.
[0035] Meanwhile, the above steps (A) through (E) may also be performed by dividing the previously set limit range into multiple sections and performing them section by section.
[0036]
[0037] The method for generating training data for customized AI assistant learning according to the present invention, as described above, can be expected to have the following effects.
[0038] In other words, the present invention has the effect of providing a neutral data set by resolving the bias of training data for artificial intelligence learning.
[0039] Furthermore, the present invention has the effect of securing a sufficient amount of training data for artificial intelligence learning by filtering collected training data to obtain unbiased and neutral training data, and generating additional synthetic data from the obtained training data.
[0040] Furthermore, the present invention secures the equilibrium of training data by utilizing the embedding vectors of the training data, and secures not only simple sum equilibrium but also distribution equilibrium of the training data, thereby enabling the training of an artificial intelligence model using a dataset in which the density phenomenon of the training data is resolved, and thus has the effect of implementing an artificial intelligence model that secures not only neutrality but also universality.
[0041] In particular, the present invention has the effect of resolving problems caused by the generation of synthetic data by resolving the congestion phenomenon that may occur when utilizing data augmentation techniques.
[0042] In addition, the present invention has the effect of improving learning efficiency and accuracy when the generated AI assistant learns user characteristics by training a personalized AI assistant model through a neutral data set.
[0043]
[0044] FIG. 1 is a block diagram illustrating a specific embodiment of a learning data generation system according to the present invention.
[0045] FIG. 2 is a flowchart illustrating a specific embodiment of the method for generating learning data according to the present invention.
[0046] FIG. 3 is a flowchart illustrating a specific embodiment of a raw data processing process constituting a method for generating training data according to the present invention.
[0047] FIG. 4 is a conceptual diagram visually illustrating the raw data filtering process according to the present invention.
[0048] FIG. 5 is a conceptual diagram visually representing the embedding vector of raw data filtered by the present invention.
[0049] Figure 6 is a conceptual diagram illustrating examples of embedding vectors of a dataset in which neutrality is secured but distribution equilibrium is not secured.
[0050] FIG. 7 is a conceptual diagram illustrating the process of generating training data in which distribution equilibrium is secured by the present invention.
[0051] FIG. 8 is a flowchart illustrating a specific embodiment of a training data bias evaluation method constituting a training data generation method according to the present invention.
[0052]
[0053] The present invention for achieving the above-mentioned purpose may take various embodiments, but typically, the present invention comprises a method for generating training data for customized AI assistant training, comprising: (A) a step of collecting raw data for training a large-scale language model; (B) a preprocessing step of selecting and excluding data to be excluded from training from the collected raw data; (C) a training data generation step of generating new data from the preprocessed raw data; (D) a training data evaluation step of merging the raw data and the new data to form training data and evaluating bias regarding the training data; (E) a step of repeatedly performing steps (B) through (D) so that a pre-set evaluation result is derived; and (F) a step of training an LLM-based AI assistant with the training data that is the result of performing step (E); wherein step (B) comprises: (B1) a duplicate data removal step of deleting duplicate data from the collected raw data; (B2) an embedding information generation step for generating embedding information for raw data from which duplicate data has been deleted; (B3) a data filtering step for comparing embedding vector values of the generated embedding information and filtering and deleting over-biased raw data having embedding vector values (Rn) that fall outside a preset limit range (within the maximum allowable deviation (Rs)); and (B4) a step of repeating the step (B3) by changing the maximum allowable deviation (Rs) of the limit range so that the sum average of the embedding vectors of the filtered raw data becomes less than or equal to a preset equilibrium reference value (P); and the step (E) comprises: (E1) a step of repeating steps (C) through (D) so that when the sum average of the embedding vectors of the training data exceeds the preset equilibrium reference value (P), the sum average of the embedding vectors of the training data becomes less than or equal to the preset equilibrium reference value (P);(E2) If the distribution (D) of the embedding values of the above training data deviates from the allowable distribution range (Da), the above steps (A) through (E1) may be repeated to further collect the training data.
[0054]
[0055] Hereinafter, with reference to the attached drawings, we will examine a method for generating training data for customized AI assistant learning according to a specific embodiment of the present invention.
[0056] Before proceeding with the explanation, the effects, features, and methods for achieving the present invention will become clear from the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but can be implemented in various different forms. These embodiments are provided merely to ensure that the disclosure of the present invention is complete and to fully inform those skilled in the art of the scope of the invention, and the present invention is defined only by the scope of the claims.
[0057] In describing the embodiments of the present invention, if it is determined that a detailed description of known functions or configurations may unnecessarily obscure the essence of the invention, such detailed description will be omitted. Furthermore, the terms described below are defined considering the functions in the embodiments of the present invention, and these may vary depending on the intentions or conventions of the user or operator. Therefore, such definitions should be based on the content throughout this specification.
[0058] Combinations of each block of the attached block diagram and each step of the flowchart may be executed by computer program instructions (execution engine), and since these computer program instructions may be loaded into the processor of a general-purpose computer, a special-purpose computer, or other programmable data processing equipment, the instructions executed through the processor of the computer or other programmable data processing equipment create a means to perform the functions described in each block of the block diagram or each step of the flowchart.
[0059] Since these computer program instructions may be stored in computer-available or computer-readable memory that can be directed toward a computer or other programmable data processing equipment to implement a function in a specific way, the instructions stored in said computer-available or computer-readable memory may also be used to produce a manufactured item containing instruction means that perform the function described in each block of a block diagram or each step of a flowchart.
[0060] And, since computer program instructions can be loaded onto a computer or other programmable data processing equipment, instructions that perform a series of operation steps on a computer or other programmable data processing equipment to create a process executed by a computer and that execute the computer or other programmable data processing equipment may also provide steps for executing the functions described in each block of the block diagram and each step of the flowchart.
[0061] Additionally, each block or each step may represent a module, segment, or part of code containing one or more executable instructions for executing specific logical functions, and in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order.
[0062] In other words, the two blocks or steps described can actually be performed substantially simultaneously, and can also be performed in the reverse order of their corresponding functions as needed.
[0063]
[0064] FIG. 1 is a block diagram illustrating a specific embodiment of a learning data generation system according to the present invention; FIG. 2 is a flowchart illustrating a specific embodiment of a learning data generation method according to the present invention; FIG. 3 is a flowchart illustrating a specific embodiment of a raw data processing process constituting a learning data generation method according to the present invention; FIG. 4 is a conceptual diagram visually representing a raw data filtering process according to the present invention; FIG. 5 is a conceptual diagram visually representing an embedding vector of raw data filtered by the present invention; FIG. 6 is a conceptual diagram illustrating examples of embedding vectors of a dataset in which neutrality is secured but distribution equilibrium is not secured; FIG. 7 is a conceptual diagram illustrating a learning data generation process in which distribution equilibrium is secured by the present invention; and FIG. 8 is a flowchart illustrating a specific embodiment of a learning data bias evaluation method constituting a learning data generation method according to the present invention.
[0065]
[0066] First, as illustrated in FIG. 1, a learning data generation system (100) for customized AI assistant learning according to the present invention is configured with a database (200) and an AI assistant (300) to provide an AI assistant (300) that is customized for the user.
[0067] To this end, the above-described learning data generation system (100) is configured to include a data collection unit (110), a data preprocessing unit (120), a data generation unit (130), and a data evaluation unit (140).
[0068] The above data collection unit (110) is a part that collects raw data from the web, news, SNS, open datasets, etc., and the collected raw data is stored in the above database (200) in various formats such as JSON, CSV, TXT, etc.
[0069] And the data preprocessing unit (120) removes duplicate data, incomplete data, and non-text from the collected raw data, and adjusts the bias to select a dataset in which neutrality is ensured.
[0070] At this time, the data preprocessing unit (120) generates an embedding vector for raw data to check and adjust bias, specifically using a natural language processing-based tokenizer and language model.
[0071] Meanwhile, the data generation unit (130) is a part that generates new data from preprocessed raw data, and can generate new data through a data augmentation technique that generates new data in the form of text substitution, back translation, sentence reconstruction, etc., or a synthetic data generation technique using a generative model (GAN or VAE).
[0072] And the above data evaluation unit (140) is a part that performs a bias evaluation on the generated training dataset and recursively repeats the filtering and generation process for the dataset so that the bias evaluation result satisfies the set criteria.
[0073] The specific execution functions and processes of the above data evaluation unit (140) will be explained in detail again when describing the method for generating learning data according to the present invention.
[0074] Meanwhile, the AI assistant (300) is configured to include a data learning unit (310), and the data learning unit (310) learns a learning data set generated by the learning data generation system (100) to implement the AI assistant (300).
[0075] At this time, the AI assistant (300) training may utilize a Transformer-based language model and may also utilize fine-tuning techniques.
[0076] Hereinafter, a method for generating training data according to the present invention will be described in detail.
[0077] As illustrated in FIG. 2, the method for generating training data according to the present invention begins with the data collection unit (110) collecting raw data (S1100).
[0078] As previously mentioned, the collection of the above raw data may be obtained from various online literature, or previously collected data may be utilized in parallel.
[0079] Next, the data preprocessing unit (120) performs a preprocessing process to select and exclude data to be excluded from learning from the collected raw data (200).
[0080] At this time, the raw data preprocessing process (S200) is performed by including the removal of duplicate data (S210), generation of embedding information (S220), data filtering (S230), bias adjustment (S240), and embedding distribution adjustment (S250), as shown in FIG. 3.
[0081] Specifically, the above duplicate data removal (S210) process refers to deleting duplicate data among the collected raw data.
[0082] At this time, along with the removal of duplicate data, a process to improve the accuracy of the above training data by correcting words and sentences through a Natural Language Processing (NLP) model may also be performed in parallel.
[0083] And the above embedding information generation process (S220) is to generate embedding information for raw data from which duplicate data has been deleted. In natural language processing (NLP), embedding techniques refer to generating vector information by converting text data, such as words, sentences, or documents, into vectors in a high-dimensional space.
[0084] Various techniques can be applied to the generation and analysis of such embedding vectors. Specifically, one-hot encoding techniques (assigning 1 to 0 to the index according to the mapping value), word embedding techniques (Word2Vec, GloVe, FastText, etc.), Doc2Vec, Universal Sentence Encoder, Transformer (BERT, GPT, RoBERTa, DistilBERT, etc.) can be applied.
[0085] And the above data filtering process (S230) refers to comparing the embedding vector values of the generated embedding information and filtering out and deleting over-biased raw data that has embedding vector values (Rn) that fall outside a preset limit range.
[0086] That is, as shown in Figure 4 (a), if the maximum allowable deviation value is Rs and the embedding vector values of two specific raw data are R1 and R2, then the raw data of R1 that has an embedding vector value greater than Rs is filtered out and excluded.
[0087] Of course, the examples illustrated in FIGS. 7 and 8 are virtual images in which one-dimensional vector values are visually compared for the sake of convenience of explanation, and since actual embedding vectors are very large high-dimensional vectors, it is impossible to visualize and represent actual embedding vectors.
[0088] In addition, although the reference value (center) of the embedding vector varies depending on each embedding technique, for the convenience of explanation and diagramming in this invention, the reference value (center value) of the embedding vector is assumed to be 0.
[0089] As shown in FIG. 4(a), if raw data having embedding vector values greater than the maximum allowable deviation (Rs) are excluded, only raw data with values less than or equal to the maximum allowable deviation (Rs) are selected, as shown in FIG. 4(b).
[0090] Subsequently, the bias adjustment process (S240) refers to repeating the data filtering process (S230) of the first (B3) step while changing the maximum allowable deviation (Rs) of the limit range so that the sum average of the embedding vectors of the filtered raw data becomes less than or equal to a preset equilibrium reference value (P).
[0091] At this time, the above equilibrium reference value (P) is an indicator representing the bias allowed for the training data. Since the bias of the training data cannot be exactly "0," it refers to a reference value that is considered to ensure neutrality when the sum of the embedding vectors falls within a specific range.
[0092] Through this process, an example of desirable raw data organized such that the sum of the embedding vectors of all raw data is less than or equal to the equilibrium reference value (P) is shown in FIG. 4 (b).
[0093] However, in special cases, even if the neutrality of the raw data is ensured through the above processes, the distribution may be concentrated in specific areas and distribution equilibrium may not be achieved.
[0094] For example, as shown in Fig. 6, in the case of raw data having embedding vector distributions as shown in (a), (b), (c), and (d) of Fig. 6, there is no bias in the raw data (neutrality is ensured), but the distribution is concentrated in a specific region and distribution equilibrium is not achieved.
[0095] As such, when an AI assistant is trained using training data that is not distributed equilibrium, there is a concern that the AI assistant's inference may lack universality.
[0096] Accordingly, in the present invention, an embedding distribution adjustment process (S250) can be performed.
[0097] The above embedding distribution adjustment process (S250) refers to adjusting the dispersion of raw data by additionally collecting raw data by repeating steps 100 to 240 when the distribution (D) of the embedding values of the filtered raw data deviates from the allowable distribution range (Da).
[0098] At this time, the allowable distribution range (Da) is an indicator for evaluating the dispersion of raw data, and is set as a range allowed from the median of the maximum allowable deviation (Rs). If the average of the absolute values of the embedding vectors of the raw data falls within the allowable distribution range (Da), it is determined that the raw data is evenly distributed.
[0099] Expressed as a mathematical formula, the distribution (D) of the above embedding values is calculated from the following [Mathematical Formula 1], and
[0100]
[0101] Here, the above Rn is the difference between the embedding vector value of the n-th data and the reference value of the embedding vector;
[0102] The above Rs is the maximum allowable deviation that serves as the criterion for over-biased data filtering;
[0103] The above m is the number of training data;
[0104] Q represents sigma.
[0105]
[0106] In the case of the calculation method of the above mathematical formula 1, the distribution (D) of the embedding value has a value between 0 and 1, and when the raw data is perfectly evenly distributed, a value of 0.5 is calculated.
[0107] Therefore, in this case, the above allowable distribution range (Da) is a range (0.5-α) in which the allowable value (α) is added or subtracted, centered at 0.5. <D<0.5-α)로 설정된다.
[0108]
[0109] However, even with the method described above, if the data is skewed near the median of the maximum allowable deviation (Rs), as shown in Fig. 6 (d), the distribution equilibrium cannot be compensated for.
[0110] Accordingly, in the present invention, as another embodiment for adjusting the embedding distribution, the process of steps 220 to 240 can be performed by dividing the pre-set limit range into multiple sections and performing the process section by section.
[0111] At this time, the more the interval is subdivided, the more perfectly the distribution equilibrium can be matched.
[0112] For example, as shown in Fig. 7, the interval from 0 to the maximum allowable deviation (Rs) is divided into three intervals, the bias adjustment process is performed for each of these intervals, and by merging these data, data with balanced distribution can be obtained.
[0113] As described above, the raw data preprocessing process (S200) was explained for one vector dimension for convenience of explanation, but is performed for each vector dimension of the embedding vector.
[0114] As such, a significant portion of the collected raw data is excluded through preprocessing, making it difficult to obtain a dataset suitable for AI training.
[0115] Accordingly, in order to secure sufficient training data, the data generation unit (230) generates new data from the preprocessed raw data (S300).
[0116] Specifically, the new training data generation described above can be generated from preprocessed raw data through a synthetic data generation model.
[0117] At this time, various models may be applied to the synthetic data generation model. For example, a data augmentation model that generates new data by applying modifications to preprocessed raw data may be applied, a generative adversarial network (GAN) model that generates new data through a network composed of a generator and a discriminator may be applied, and a variational autoencoder (VAE) model that generates new data by learning the probability distribution of input data may also be applied.
[0118] Afterwards, the data evaluation unit (140) re-evaluates the bias of the training data (S400).
[0119] At this time, the term "training data" refers to data obtained by merging preprocessed raw data and new data generated through the raw data.
[0120] Bias evaluation for the above training data is performed by including accuracy correction for the training data (S410), generation of embedding information for the training data (S420), and a bias evaluation process (S430), as illustrated in FIG. 8.
[0121] The accuracy correction (S410) for the above training data is a process of improving the accuracy of the above training data by correcting words and sentences through a Natural Language Processing (NLP) model, and the same technical principles as the accuracy improvement process mentioned in the explanation of the process of removing duplicate data (S210) are applied.
[0122] Of course, in this case, if accuracy correction has already been performed on the raw data among the above training data by step 210, it is also possible to perform the correction only on the new data among the above training data.
[0123] And the generation of embedding information for the above training data (S420) is also done using the technique described in step 220, but since the embedding information for the raw data has already been generated by step 220, it is also possible to generate embedding information only for the new data among the above training data.
[0124] In addition, the bias evaluation process (S430) refers to analyzing the embedding result values for the entire training data to evaluate the bias of the training data.
[0125] Of course, theoretically, since the raw data itself is data that has been corrected for neutrality (bias) and distribution equilibrium, the training data to which new data generated based on it has been added also achieves neutrality and distribution equilibrium at a very high rate.
[0126] However, to reconfirm this, a bias evaluation process (S430) for the above training data is performed.
[0127] The bias evaluation process (S430) for the above training data refers to checking whether the sum average of the embedding vectors of the above training data exceeds a preset equilibrium reference value (P).
[0128] Subsequently, if the average sum of the embedding vectors of the training data exceeds the equilibrium threshold value (P) as a result of the bias evaluation of the training data, the bias correction process is recursively performed by repeating steps 300 to 400 so that the average sum of the embedding vectors of the training data becomes less than or equal to the equilibrium threshold value (P).
[0129] In addition, in order to re-verify the distribution equilibrium, if the distribution (D) of the embedding values of the training data deviates from the allowable distribution range (Da), the present invention recursively performs a process to supplement the distribution equilibrium by repeating the training data bias evaluation step described above from the raw data collection step (S100) (S500).
[0130] Of course, as mentioned above, the verification and adjustment process of steps 100 through 500 is performed for each vector dimension of the embedding vector.
[0131] When training data with neutrality and distribution equilibrium is secured through such a process, the data learning unit (310) trains the AI assistant (300) using the training data (S600).
[0132]
[0133] The present invention as described above can be implemented through various functional functions, and below we will examine examples of functional functions applied for this purpose.
[0134]
[0135] 1. Definition of Variables
[0136]
[0137]
[0138] 2. Function for the Jeonriri process
[0139] (1) Polyhedral overbiased filter
[0140]
[0141] Filtering of multidimensional overbias in data is performed using the Mahalanobis distance through the above formula ( ) is the maximum allowable distance ( It can be performed so as not to exceed ).
[0142]
[0143] (2) Average bias control
[0144]
[0145] Mean bias control can be controlled through the above equation so that the norm value of the sample mean (μ) becomes less than or equal to a preset value (P).
[0146]
[0147] (3) Interval sampling
[0148]
[0149] Interval sampling of the data can be controlled by calculating the Kernel MMD (MMD: Maximum mean discrepancy) distribution distance of the data through the above equation so that the calculated distribution distance is less than or equal to a preset value (ε).
[0150]
[0151] (4) Sphericity index
[0152]
[0153] The multidimensional sphericity index (ψ) is calculated from the above equation, and the sphericity index is the pre-set minimum value of the sphericity index ( Data can be managed to be greater than )
[0154]
[0155] 3. Function for Quality Verification
[0156]
[0157] The quality of the training data should be verified by classifying it into bias, tox, and fact through LLM, but
[0158] Data can be integrated (Ensemble, multiple LLM, self-consistency) based on [the standard] to calculate the integrated bias score (B) using the following formula, and then it can be verified whether the calculated integrated bias score (B) is less than or equal to a preset value (τ).
[0159]
[0160]
[0161] 4. Functional function for loss management based on LLM bias regularization
[0162]
[0163] The loss index (L) for data loss management can be calculated by the arithmetic formula above.
[0164]
[0165]
[0166] 5. Examples of criteria for verification items
[0167] In the present invention, various items are controlled to manage bias in training data. Regarding training data, training data can be controlled by evaluating fairness, toxicity, fact, and memorization according to the following criteria.
[0168]
[0169]
[0170] The core of this invention lies in quantitatively securing neutrality and distribution balance at every stage, from data collection and preprocessing to evaluation and iterative learning. This contributes to enabling initial learning while maintaining neutrality before a personalized AI assistant learns individual characteristics.
[0171]
[0172] The rights of the present invention are not limited to the embodiments described above but are defined by the claims, and it is obvious that a person skilled in the art may make various modifications and adaptations within the scope of the rights described in the claims.
[0173]
[0174] The present invention relates to a method for generating training data for customized AI assistant learning that resolves bias in training data to generate a large language model with secured neutrality. According to the present invention, there is an effect of providing a neutral data set by resolving bias in training data for artificial intelligence learning.
Claims
1. (A) A step of collecting raw data for training a large-scale language model; and (B) A preprocessing step for selecting and excluding data to be excluded from training from the collected raw data; (C) A training data generation step for generating new data from preprocessed raw data; (D) a training data evaluation step of merging the above raw data and new data to form training data and evaluating bias in the above training data; (E) A step of repeating steps (B) through (D) above to obtain a pre-set evaluation result; and (F) A step of training an LLM-based AI assistant with the training data that is the result of the above step (E); characterized by being performed including a method for generating training data for customized AI assistant training.
2. In Paragraph 1, The above step (B) is, (B1) A duplicate data removal step for deleting duplicate data from collected raw data; (B2) An embedding information generation step for generating embedding information for raw data from which duplicate data has been deleted; (B3) A data filtering step for comparing the embedding vector values of the generated embedding information and filtering out and deleting over-biased raw data having embedding vector values (Rn) that fall outside a preset limit range (within the maximum allowable deviation (Rs)); and (B4) A step of repeating the step (B3) by changing the maximum allowable deviation (Rs) of the limit range so that the sum average of the embedding vectors of the filtered raw data becomes less than or equal to a preset equilibrium reference value (P); characterized by including the step of generating training data for customized AI assistant learning.
3. In Paragraph 2, The above step (B1) is, A method for generating training data for customized AI assistant training, characterized by including a process of improving the accuracy of the raw data by correcting words and sentences through a Natural Language Processing (NLP) model.
4. In Paragraph 2, The above step (B) is, (B5) A step of additionally collecting raw data by repeating steps (A) through (B4) when the distribution (D) of the embedding values of the filtered raw data deviates from the allowable distribution range (Da); characterized by further including the step of additionally collecting raw data.
5. In Paragraph 4, The distribution (D) of the above embedding values is, Calculated from the following [Mathematical Formula 1], [Mathematical Formula 1] The above Rn is the difference between the embedding vector value of the n-th data and the embedding reference value; The above Rs is the maximum allowable deviation that serves as the criterion for over-biased data filtering; A method for generating training data for customized AI assistant training, characterized in that the above m is the number of training data.
6. In Paragraph 4, The above steps (B2) to (B5) are, A method for generating training data for customized AI assistant learning, characterized by being performed for each vector dimension of the embedding vector.
7. In any one of paragraphs 1 through 6, The generation of new training data in the above step (C) is, A method for generating training data for customized AI assistant learning, characterized by being generated from preprocessed raw data through a synthetic data generation model.
8. In Paragraph 7, The above synthetic data generation model is, A method for generating training data for customized AI assistant learning, characterized by comprising a data augmentation model that generates new data by applying transformations to the above-mentioned preprocessed raw data.
9. In Paragraph 7, The above synthetic data generation model is, A method for generating training data for customized AI assistant learning, characterized by comprising a Generative Adversarial Network (GAN) model that generates new data through a network composed of a generator and a discriminator.
10. In Paragraph 7, The above synthetic data generation model is, A method for generating training data for customized AI assistant learning, characterized by comprising a variational autoencoder (VAE) model that learns the probability distribution of input data to generate new data.
11. In Paragraph 7, The above step (D) is, (D1) A step of improving the accuracy of the training data by correcting words and sentences through a Natural Language Processing (NLP) model; (D2) A method for generating training data for customized AI assistant learning, characterized by including the step of analyzing the embedding result value for the training data and evaluating the bias of the training data.
12. In Paragraph 11, The above (E) step is, (E1) A step of repeating steps (C) through (D) such that when the sum average of the embedding vectors of the training data exceeds a preset equilibrium reference value (P), the sum average of the embedding vectors of the training data becomes less than or equal to the preset equilibrium reference value (P); characterized by being performed including the step of repeating steps (C) through (D).
13. In Paragraph 12, The above step (E1) is, A method for generating training data for customized AI assistant learning, characterized by being performed for each vector dimension of the embedding vector.
14. In Paragraph 12, The above (E) step is, (E2) When the distribution (D) of the embedding values of the above training data deviates from the allowable distribution range (Da), the above steps (A) through (E1) are repeated to collect additional training data, characterized by a method for generating training data for customized AI assistant training.
15. In Paragraph 14, The above step (E2) is, A method for generating training data for customized AI assistant learning, characterized by being performed for each vector dimension of the embedding vector.