In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with specific embodiments of the present invention. Obviously, the described embodiments are the present invention. Some embodiments, not all embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
In the fields of machine learning and pattern recognition, it is generally necessary to divide samples into three independent training sets (train set), validation set (validation set) and test set (test set). The training set is used to estimate the model, the validation set is used to determine the network structure or the parameters that control the complexity of the model, and the test set is used to test the performance of the optimal model. A typical division is that the training set occupies 50% of the total sample, and the others account for 25%. The three parts are randomly selected from the sample.
In this embodiment, the validation set in model training is selected as the candidate data set. Preferably, the data of the validation set is text human-machine dialogue data or speech-to-text human-machine dialogue data. The initial category or the The predicted category is the user intention category based on semantic analysis.
A text classification method of this embodiment includes the following steps:
The target model is used to predict the data of the validation set to obtain the initial category, prediction category, and prediction score of the validation set; wherein, the prediction category or the initial category includes category I and category Ii, i=1, 2 ....n;
Select category I as the category to be optimized, and extract data set A from the verification set according to category I to be optimized; wherein the predicted category of the data set A is category I;
Extract a data set Ai from the data set A; wherein the initial category of the data set Ai is category Ii;
Sort the data set Ai according to the prediction score, and perform statistics on the prediction score and prediction accuracy of the data set Ai according to the sorting result, and obtain the statistical score Si,i=1, 2,... .n;
Input the text to be classified into the target model for prediction. When the output prediction category is category I and the prediction score is S, calculate (S-Si)/(Si), i=1, 2....n; If (S-Si)/(Si) are both less than 0, the classification of the text to be classified is obtained as category I; if there is an item greater than 0 in (S-Si)/(Si), select such that (S- Si)/(Si) is the largest i, and the classification of the text to be classified is obtained as category Ii.
Preferably, the prediction score adopts a softmax score, and the prediction result of the target model is normalized by applying a softmax function to output a sequence of prediction scores with a sum of 1; The location of the maximum value then determines the final prediction category.
According to the prediction score and the prediction accuracy rate, the statistical score Si is obtained by performing a threshold calculation on the prediction score in the data set Ai. When the predicted score is less than the statistical score, the predicted score is The corresponding prediction category has the lowest accuracy.
In this embodiment, the method for calculating the statistical score Si includes the following steps:
Arrange the data set Ai according to its predicted scores in ascending order;
Let Si,n be the predicted scores in the data set Ai, and extract the predicted data with the predicted scores less than Si,n (n=1,2...len(R)) from the data set Ai; where, len( R) represents the length of the score sequence.
The accuracy rate of the current data set is calculated, and when the accuracy rate is the lowest when the accuracy rate is Si, n, then Si, n is taken as the statistical score Si. As shown in the following table:
Specifically, in addition to category I, the prediction category or the initial category also includes categories Ii, I2...In; firstly select the data set A, the data set with the initial category I1 is the data set A1, and the data set A1 is sorted according to the predicted score; then, according to the sorting result, the softmax score and the prediction accuracy of A1 are counted, and the predicted score S1 can be obtained, so that when the part C1 in the data set whose predicted score is less than S1 is selected, the accuracy of C1 The rate is the lowest; then, repeat the above steps until I1, I2...In are traversed, and the statistical scores S1, S2...Sn are obtained.
Wherein, the softmax score is calculated by using the Softmax function. The Softmax function is also called a normalized exponential function, which is a generalization of a logical function. It can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector σ(z), so that the range of each element is between (0,1), and the sum of all elements is 1. The Softmax formula is as follows:
Among them, x represents the current unnormalized value, Wn represents the weight of the current value (the weight is averaged), and j and k are count subscripts. That is, the prediction result of the model is transformed into an exponential function, which ensures the non-negativity of the probability, and then divides the transformed result by the sum of all transformed results to obtain the softmax score. The softmax score can be understood as the percentage of the converted result to the total.
The softmax score algorithm of this embodiment is suitable for multi-classification problems. Before the target model outputs the prediction result, the output layer will be normalized by applying the softmax function to finally output a sequence of prediction scores with a sum of 1. The position of the maximum value in the sequence determines the prediction result category.
In this embodiment, the initial category of the verification set is manually annotated, and the prediction category of the verification set is obtained by predicting the data of the verification set through the target model; and further according to the prediction category and the initial The difference of the categories obtains the verification loss value; according to the verification loss value, it is determined whether to stop training the target model.
Corresponding to the text classification method, the present invention also provides a question answering system, which includes:
A category pre-judgment module, which uses a target model to predict the data of the verification set to obtain the initial category, predicted category, and predicted score of the verification set; wherein, the predicted category or the initial category includes category I and category Ii ,i=1,2....n;
The data screening module selects category I as the category to be optimized, and extracts a data set A from the verification set according to the category I to be optimized; wherein the predicted category of the data set A is category I; and, from the data Data set Ai is extracted from set A; wherein, the initial category of said data set Ai is category Ii;
A score statistics module, which sorts the data set Ai according to the predicted scores, and calculates the predicted scores and prediction accuracy rates of the data set Ai according to the ranking results, to obtain the statistical score Si,i= 1,2....n;
The category judgment module is used to input the text to be classified into the target model for prediction. When the output prediction category is category I and the prediction score is S, calculate (S-Si)/(Si), i=1, 2 ....n; if (S-Si)/(Si) are both less than 0, the classification of the text to be classified is obtained as category I; if there is an item greater than 0 in (S-Si)/(Si), Then, the i that maximizes (S-Si)/(Si) is selected, and the classification of the text to be classified is obtained as category Ii.
In addition, the present invention also provides a dialogue robot. The dialogue robot includes the question answering system as described above. Specifically, the dialogue robot includes a memory, a processor, and is stored in the memory and can run on the processor. The question answering system, when the question answering system is executed by the processor, implements the steps of the text classification method described in any one of the above.
The dialogue robot includes but is not limited to: industrial robots, service robots, intelligent customer service systems, etc., or intelligent devices with text input functions or voice input functions.
Specifically, the dialogue robot may include components such as a memory, a processor, an input unit, a display unit, and a power supply. Among them, the memory can be used to store software programs and modules, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area. The storage program area can store an operating system, at least one application program required by a function, etc.; the storage data area can store a question and answer library or a knowledge base created according to the use of the dialog robot Wait. In addition, the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory may also include a memory controller to provide access to the memory by the processor and the input unit.
The input unit can be used to receive input numbers or characters or image information, and to generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. Specifically, in addition to a microphone, the input unit of this embodiment may also include a touch-sensitive surface (such as a touch screen) and other input devices.
The display unit can be used to display information input by the user or information provided to the user, as well as various graphical user interfaces of the dialogue robot. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof. The display unit may include a display panel. Optionally, the display panel may be configured in the form of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), etc.
It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. For the same and similar parts between the various embodiments, refer to each other. can. For the question answering system embodiment and the dialog robot embodiment, since they are basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment.
And, in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or also include elements inherent to the process, method, article, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
The above description shows and describes the preferred embodiments of the present invention. It should be understood that the present invention is not limited to the form disclosed herein, and should not be regarded as an exclusion of other embodiments, but can be used in various other combinations, modifications and environments. , And can be modified through the above teachings or technology or knowledge in related fields within the scope of the present invention. However, modifications and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should fall within the protection scope of the appended claims of the present invention.