Cluster-based new issue discovery method and apparatus

By training a classifier with new questions bearing pseudo-labels and using prior knowledge of existing questions for clustering and classification iterations, the problem of low accuracy in discovering new FAQ questions is solved, and the accuracy of new question discovery is improved.

CN115795003BActive Publication Date: 2026-06-26CHINA UNITED NETWORK COMM GRP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA UNITED NETWORK COMM GRP CO LTD
Filing Date
2022-11-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, the accuracy of new FAQ questions is low, mainly because a small amount of prior knowledge is difficult to transfer and most new FAQ questions are unsupervised, making it difficult to construct high-quality supervisory signals to guide the learning of new questions.

Method used

By training a classifier with a new question that has pseudo-labels, and by using prior knowledge of the existing questions to perform an iterative process of clustering and classification, the alignment labels of the new questions are obtained, making full use of the association between the existing questions and the new questions.

Benefits of technology

It improves the accuracy of new FAQ problem discovery by obtaining better feature vectors for new problems through iterative processes of clustering and classification, thereby enhancing the accuracy of new problem discovery.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115795003B_ABST
    Figure CN115795003B_ABST
Patent Text Reader

Abstract

The application provides a new question discovery method and device based on clustering, and relates to the technical field of artificial intelligence, and comprises the following steps: training a classifier according to a new question with a pseudo label, obtaining a current feature vector of the new question based on the trained classifier, clustering the new question according to the current feature vector, obtaining a current clustering result, obtaining an alignment label in a previous clustering result which is aligned with a pseudo label in the current clustering result, taking the alignment label as a new pseudo label, iteratively executing the above process until there is no new clustering cluster in the current clustering result, and outputting the alignment label of the new question obtained at present. The classifier is trained by the new question with the pseudo label, and the above process is iteratively executed through clustering and classification until a better feature vector of the new question is obtained, the alignment label of the new question is obtained, the existing questions are fully utilized, the existing questions and the new question are associated, and therefore the accuracy of discovering the new question is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a novel problem discovery method and apparatus based on clustering. Background Technology

[0002] Frequently Asked Questions (FAQs) are a means of providing online help. They involve pre-organizing a list of frequently asked questions and answers, publishing them on a webpage to provide users with consultation services, such as instructions or usage help for a product.

[0003] To ensure the effectiveness of FAQs, questions and answers need to be updated frequently to address some of the most frequently asked questions by customers. In existing technologies, the discovery of new FAQ questions is usually carried out by treating new FAQ questions as semi-supervised problems, using existing questions as prior knowledge to guide the clustering process of new FAQ questions and thus discovering new FAQ questions.

[0004] However, limited prior knowledge is difficult to transfer and use to discover new FAQ questions, and most new FAQ questions are unsupervised, making it difficult to construct high-quality supervisory signals to guide the learning of new FAQ questions, resulting in low accuracy in discovering new FAQ questions. Summary of the Invention

[0005] This application provides a clustering-based method and apparatus for discovering new problems. By training a classifier with new problems and performing clustering and classification iteratively until a better feature vector of the new problem is obtained, the alignment label of the new problem is obtained. This fully utilizes existing problems and associates existing problems with new problems, thereby improving the accuracy of discovering new problems.

[0006] In a first aspect, this application provides a novel problem discovery method based on clustering, including:

[0007] The classifier is trained based on the new question with pseudo-labels, and the current feature vector of the new question is obtained based on the trained classifier.

[0008] The new problem is clustered based on the current feature vector to obtain the current clustering result, which includes the clusters and their indices. The indices of the clusters are the pseudo-labels of the new problem.

[0009] Obtain the alignment label from the previous clustering result that is aligned with the pseudo-label in the current clustering result, and use the alignment label as the new pseudo-label;

[0010] The above steps are executed iteratively until no new clusters are found in the current clustering results, at which point the alignment label of the new problem obtained is output.

[0011] Optionally, the first new question with pseudo-tags is obtained as follows:

[0012] The classifier is trained based on the existing problem and the corresponding classification label to obtain the trained classifier.

[0013] The initial feature vector for the new problem is obtained based on the classifier.

[0014] The new problem is clustered based on the initial feature vector to obtain an initial clustering result, which includes clusters and cluster numbers, and the cluster numbers are pseudo-labels for the new problem.

[0015] Optionally, training the classifier based on the new question with pseudo-labels includes:

[0016] The pseudo-label of the new problem is used as the real label of the classifier;

[0017] Calculate the KL divergence between the true label and the predicted label output by the classifier based on the new question, and iteratively train the classifier based on the KL divergence;

[0018] When the loss function of the KL divergence converges, the trained classifier is obtained.

[0019] Optionally, after clustering the new problem based on the current feature vector to obtain the current clustering result, the method further includes:

[0020] The confidence level of each cluster is obtained based on the preset number of clusters and the number of new questions in the clusters.

[0021] New questions in clusters with confidence levels below a preset value are assigned to neighboring clusters, resulting in a new number of clusters. This new number of clusters is used to obtain clusters.

[0022] Optionally, the clustering result further includes cluster centers; the step of assigning new questions in clusters with confidence levels below a preset value to neighboring clusters includes:

[0023] Obtain the Euclidean distance between the cluster centers of the clusters with confidence levels below a preset value and the centers of the other clusters;

[0024] Clusters with confidence levels below a preset value are assigned to the cluster with the smallest Euclidean distance.

[0025] Optionally, obtaining the alignment label in the previous clustering result that is aligned with the pseudo-label in the current clustering result includes:

[0026] The Hungarian algorithm is used to obtain the mapping relationship between the cluster centers in the current clustering result and the cluster centers in the previous clustering result;

[0027] Based on the mapping relationship, obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result.

[0028] Optionally, the step of training the classifier based on the existing problem and the corresponding classification label to obtain the trained classifier includes:

[0029] The existing problem is classified using a classifier to obtain predicted labels;

[0030] Obtain the cross-entropy loss between the predicted label and the true label, and update the classifier parameters based on the cross-entropy loss;

[0031] Repeat the process of obtaining the cross-entropy loss between the predicted label and the true label until the cross-entropy loss is less than a preset value.

[0032] Secondly, this application provides a novel problem discovery device based on clustering, comprising:

[0033] The acquisition module is used to train a classifier based on a new question with pseudo-labels, and to obtain the current feature vector of the new question based on the trained classifier;

[0034] The clustering module is used to cluster the new problem based on the current feature vector to obtain the current clustering result, which includes the clusters and the index of the clusters, and the index of the clusters is the pseudo-label of the new problem.

[0035] The alignment module is used to obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label;

[0036] The execution module is used to iteratively execute the above steps until there are no new clusters in the current clustering results, and then outputs the alignment label of the new problem obtained.

[0037] Thirdly, this application provides an electronic device, including: a memory and a processor;

[0038] The memory is used to store computer instructions; the processor is used to execute the computer instructions stored in the memory to implement the method of any one of the first aspects.

[0039] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the method of any of the first aspects.

[0040] Fifthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method of any one of the first aspects.

[0041] The clustering-based method and apparatus for discovering new problems provided in this application train a classifier based on new problems with pseudo-labels, obtains the current feature vector of the new problem based on the trained classifier, clusters the new problem according to the current feature vector, and obtains the current clustering result, which includes the cluster and the cluster index. The cluster index is the pseudo-label of the new problem. Alignment labels that align with the pseudo-labels in the previous clustering result and are used as new pseudo-labels are obtained. This process is iteratively executed until no new clusters are found in the current clustering result, and then the alignment label of the new problem is output. By training the classifier with new problems with pseudo-labels and iterating through clustering and classification until a good feature vector of the new problem is obtained, the alignment label of the new problem is obtained. This fully utilizes existing problems and associates them with the new problem, thereby improving the accuracy of discovering new problems. Attached Figure Description

[0042] Figure 1 A flowchart illustrating a novel problem discovery method based on clustering is provided for embodiments of this application. Figure 1 ;

[0043] Figure 2 A flowchart illustrating the clustering-based novel problem discovery method provided in this application embodiment. Figure 2 ;

[0044] Figure 3 A schematic diagram of the structure of the cluster-based novel problem discovery device provided in this application embodiment;

[0045] Figure 4 A schematic diagram of the structure of a cluster-based novel problem discovery electronic device provided in an embodiment of this application. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0047] To facilitate a clear description of the technical solutions in the embodiments of this application, some terms and technologies involved in the embodiments of this application will be briefly introduced below:

[0048] 1) Clustering: The process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects is called clustering. The clusters generated by clustering are a set of data objects that are similar to objects in the same cluster and different from objects in other clusters.

[0049] 2) K-means clustering algorithm: Also known as K-means clustering, it is an iterative clustering analysis algorithm. Its steps are: randomly select K objects as initial cluster centers; then calculate the distance between each object and each seed cluster center; and assign each object to the nearest cluster center. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster centers are recalculated based on the existing objects in the cluster. This process is repeated until a certain termination condition is met.

[0050] 3) Hungarian algorithm: It is a combinatorial optimization algorithm that solves the task assignment problem in polynomial time and can also be used to solve matching problems.

[0051] 4) Other terms

[0052] In the embodiments of this application, the terms "first" and "second" are used to distinguish identical or similar items with essentially the same function and effect, without limiting their order. Those skilled in the art will understand that the terms "first" and "second" do not limit the quantity or execution order, and that the terms "first" and "second" do not necessarily imply that they are different.

[0053] It should be noted that, in the embodiments of this application, the terms "exemplary" or "for example" are used to indicate examples, illustrations, or descriptions. Any embodiment or design scheme described as "exemplary" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or design schemes. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a specific manner.

[0054] FAQs are a means of providing online help. They are pre-organized question-and-answer pairs that are published on a webpage to provide users with consultation services, such as instructions or usage help for a product.

[0055] To ensure the effectiveness of the FAQ, the questions and answers need to be updated frequently to address some of the most frequently asked questions from customers.

[0056] There are two approaches to discovering new FAQ questions: one is to treat it as an unsupervised clustering problem and improve clustering performance by introducing effective weak supervision signals; the other is to treat it as a semi-supervised problem and use existing questions as prior knowledge to guide the clustering process.

[0057] However, both methods have the following two problems: first, most new problems are unsupervised, making it difficult to construct high-quality supervisory signals to guide the discovery process; second, most weakly supervised or supervised signals still do not fully utilize existing labeled data, resulting in poor accuracy in discovering new problems.

[0058] In view of this, embodiments of this application provide a clustering-based method and apparatus for discovering new FAQ questions (hereinafter referred to as existing questions). By using prior knowledge of existing FAQ questions (hereinafter referred to as existing questions), the method generalizes to the discovery of new FAQ questions (hereinafter referred to as new questions). Based on semi-supervised learning, the method continuously learns the discovery of new questions, thereby improving the accuracy of discovering new questions.

[0059] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. The following specific embodiments can be implemented independently or in combination with each other. The same or similar concepts or processes may not be described again in some embodiments.

[0060] Figure 1 A flowchart illustrating the clustering-based novel problem discovery method provided in this application embodiment. Figure 1 ,like Figure 1 As shown, it includes the following steps:

[0061] S101. Train the classifier based on the new question with pseudo-labels, and obtain the current feature vector of the new question based on the trained classifier.

[0062] In this embodiment, a label is an identifier used to describe the category of a new problem; different types of new problems have different labels. A pseudo-label is an identifier used temporarily to describe the category of a new problem during the discovery process. The label and pseudo-label of a new problem may be the same or different.

[0063] A classifier is a model used to classify problems. In this embodiment, a classification layer can be superimposed on a feature extractor based on a BERT pre-trained model as the classifier used in this embodiment.

[0064] In this embodiment of the application, training the classifier based on a new problem with pseudo-labels means updating the parameters of the classifier based on the pseudo-labels of the new problem. For example, the classifier is used to classify the new problem to obtain the classification result, i.e., the classification label. The loss function between the classification label and the pseudo-label is calculated. The classifier is updated based on the loss function. The above process is iteratively executed until the loss function meets the requirements and the trained classifier is obtained.

[0065] In this embodiment of the application, the feature vector is used to reflect some inherent features of the new problem itself, such as the length of the text, keywords, and other information.

[0066] In this embodiment of the application, the classification layer of the classifier after training is removed to obtain a feature extractor that can extract features for a new problem. The feature extractor is used to extract features for the new problem to obtain the current feature vector of the new problem.

[0067] S102. Cluster the new problem based on the current feature vector to obtain the current clustering result. The current clustering result includes the clusters and their indices. The indices of the clusters are the pseudo-labels of the new problem.

[0068] In this embodiment, a cluster refers to a set of multiple new problems that are similar when clustering a new problem. The cluster number is a number randomly assigned during the clustering process based on different clusters, and different clusters have different numbers. The cluster number can be used to represent the cluster category of the new problem, that is, the current label of the new problem.

[0069] In this embodiment, a general clustering algorithm can be used to cluster the new problem, such as the K-means clustering algorithm. This embodiment does not limit the type of clustering algorithm.

[0070] In this embodiment of the application, a clustering algorithm is used to cluster the new problem based on the extracted current feature vector to obtain the current clustering result.

[0071] In this embodiment of the application, after obtaining the cluster number of the new problem, the cluster number can be used as a pseudo-label for the new problem, and the subsequent new problem discovery process can be carried out based on the pseudo-label.

[0072] Optionally, in order to guide the process of discovering new problems by utilizing existing problems, that is, to construct effective monitoring signals, this embodiment of the application guides the process of discovering new problems by using the labels of existing problems.

[0073] Optionally, in this embodiment, an initial classifier is trained using existing problems and their corresponding labels. New problems are then clustered based on the trained initial classifier, and the cluster number in the clustering result is used as the pseudo-label for the first new problem. Training the classifier using existing problems strengthens the distinction and connection between new and existing problems during the discovery process, thereby improving the accuracy of new problem discovery.

[0074] S103. Obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label.

[0075] In this embodiment of the application, the aligned label refers to the label obtained by aligning the pseudo-labels in the clustering results with the pseudo-labels of the new problem before training.

[0076] In this embodiment of the application, an alignment algorithm can be used to obtain the alignment label of the new problem, such as the Hungarian algorithm.

[0077] In this embodiment of the application, since the clustering results may be different each time, in order to maintain the consistency of the training labels of the trained classifier in each round, it is necessary to align the pseudo-labels in the previous clustering result with those in the current clustering result.

[0078] In this embodiment of the application, after obtaining the alignment label, the alignment label is used as a new pseudo-label for the new problem, and the subsequent new problem discovery process is carried out.

[0079] S104. Iteratively execute steps S101 to S103 until there are no new clusters in the current clustering result, and output the alignment label of the new problem obtained.

[0080] In this embodiment of the application, after the alignment label is used as the new pseudo label for the current problem, the process of classifying and clustering the new problem shown in steps S101 to S103 is repeated according to the new pseudo label until there are no new clusters in the current clustering results during clustering. At this point, the process of discovering the new problem is stopped, and the alignment label of the new problem is output as the classification label, that is, the new problem has been discovered.

[0081] The clustering-based new problem discovery method provided in this application trains a classifier based on new problems with pseudo-labels, obtains the current feature vector of the new problem based on the trained classifier, clusters the new problem according to the current feature vector, and obtains the current clustering result, which includes the cluster and the cluster index. The cluster index is the pseudo-label of the new problem. Alignment labels that align with the pseudo-labels in the previous clustering result and are used as new pseudo-labels are obtained. This process is iteratively executed until no new clusters are found in the current clustering result, and then the alignment label of the new problem is output. By training the classifier with new problems with pseudo-labels and iterating through clustering and classification until a good feature vector of the new problem is obtained, the alignment label of the new problem is obtained. This fully utilizes existing problems and associates them with the new problem, thereby improving the accuracy of new problem discovery.

[0082] Figure 2 A flowchart illustrating the clustering-based novel problem discovery method provided in this application embodiment. Figure 2 ,exist Figure 1 Based on the embodiments shown, the clustering-based novel problem discovery method shown in the embodiments of this application will be further described, such as... Figure 2 As shown, it includes the following steps:

[0083] S201. Train the classifier based on the existing problem and the corresponding classification label to obtain the trained classifier.

[0084] In this embodiment, the existing questions are those in the FAQ knowledge base. Each question has its corresponding category label, such as knowledge title and question category. The classifier is pre-trained based on the existing questions to obtain a trained classifier. The trained classifier ensures that subsequent feature extraction can proceed smoothly.

[0085] In this embodiment of the application, the classifier is trained based on the existing problem and the corresponding classification label as follows:

[0086] a1: Use a classifier to classify the existing problem and obtain the predicted label.

[0087] In this embodiment, the classifier is a classification layer superimposed on the feature extractor of the BERT pre-trained model. The feature extractor of the BERT pre-trained model can extract feature vectors from the existing problem and input the extracted feature vectors into the classification layer to obtain the classification label of the existing problem output by the classifier.

[0088] a2: Obtain the cross-entropy loss between the predicted label and the true label, and update the classifier parameters based on the cross-entropy loss.

[0089] In this embodiment, the true label is the classification label corresponding to the existing problem. The cross-entropy loss between the classification label output by the classifier and the true label is calculated, and the classifier parameters are updated based on the value of the cross-entropy loss.

[0090] In this embodiment, the cross-entropy loss function can be used to calculate the cross-entropy loss between the classifier's output label and the true label. For example, the Loss function, the MSE function, etc. This embodiment does not limit the type of cross-entropy loss function.

[0091] a3: Repeat the process of obtaining the cross-entropy loss between the predicted label and the true label until the cross-entropy loss is less than the preset value.

[0092] In this embodiment of the application, after updating the classifier parameters according to the value of cross-entropy loss, the updated classifier is used to classify the existing problem to obtain new predicted labels. The cross-entropy loss is calculated based on the new predicted labels and the true labels, and the process of updating the classifier parameters is iteratively executed until the cross-entropy loss is less than a preset value.

[0093] S202. Based on the trained classifier, extract features for the new problem and cluster the extracted feature vectors to obtain the initial clustering results.

[0094] In this embodiment, the classification layer of the classifier trained on the existing problem is removed and used as the feature extractor for the new problem feature extraction.

[0095] The new problem is input into the feature extractor to obtain the feature vector of the new problem output by the feature extractor. Then, a clustering algorithm, such as the K-means clustering algorithm, is used to cluster the feature vector to obtain the clustering result of the new problem. The cluster index in the clustering result is used as the pseudo label of the new problem.

[0096] Optionally, a clustering number needs to be preset before clustering, and clustering is performed based on this clustering number. In this embodiment of the application, twice the number of existing question labels can be used as the clustering number for new questions. It is understood that this clustering number can also be set according to actual needs.

[0097] S203. Train the classifier based on the new problem with pseudo-labels, and obtain the current feature vector of the new problem based on the trained classifier.

[0098] In this embodiment of the application, after obtaining the pseudo-label of the new problem, the classifier can be trained based on the pseudo-label of the new problem to better classify the new problem, as shown below:

[0099] Use the pseudo-label of the new problem as the real label of the classifier; calculate the KL divergence between the real label and the predicted label of the classifier based on the output of the new problem, and iteratively train the classifier based on the KL divergence; when the loss function of the KL divergence converges, the trained classifier is obtained.

[0100] In this embodiment of the application, KL divergence is a metric used to measure the similarity between two probability distributions.

[0101] In this embodiment of the application, the pseudo-labels of the new questions obtained by clustering follow a probability distribution P, and the predicted labels output by the classifier based on the new questions follow a probability distribution Q. The KL divergence between the true labels and the predicted labels output by the classifier based on the new questions can be obtained according to the first formula.

[0102] The first formula is shown below:

[0103]

[0104] Where, p ij q represents the true label probability value. ij To predict the label probability value, i is the sample number and j is the label number.

[0105] In this embodiment of the application, after obtaining the KL divergence, the parameters in the classifier are adjusted according to the KL divergence, and the above process is iteratively executed until the loss function of the KL divergence converges, thus obtaining the trained classifier.

[0106] S204. Cluster the new problem based on the current feature vector to obtain the current clustering result.

[0107] The implementation method of the steps shown in embodiment S204 of this application is the same as Figure 1 The steps shown in S102 in the illustrated embodiment are similar and will not be repeated here.

[0108] It is understood that in the embodiments of this application, when clustering new problems, the preset number of clusters is determined based on the number of pseudo-labels for the new problems. For example, the preset number of clusters can be twice the number of pseudo-labels for the new problems.

[0109] S205. Assign new questions in clusters with confidence levels lower than the preset value to neighboring clusters.

[0110] In this embodiment of the application, after obtaining the clustering results, it is necessary to obtain the confidence level of each cluster and remove the clusters with low confidence levels to reduce the workload of subsequent new problem discovery.

[0111] Specifically, based on the preset number of clusters and the number of new questions in each cluster, the confidence level of each cluster is obtained; new questions in clusters with confidence levels lower than the preset value are assigned to neighboring clusters, and a new number of clusters is obtained. The new number of clusters is used to obtain clusters.

[0112] In this embodiment, the confidence level of each cluster can be determined by the ratio of the number of new questions in the cluster to a preset number of clusters. The preset value is determined by the ratio between the number of new questions to be clustered and the preset number of clusters. Clusters with confidence levels lower than the preset value are removed to obtain a new number of clusters.

[0113] That is, the new cluster number can be obtained according to the following formula:

[0114]

[0115] Where K′ is the preset number of clusters, N is the number of cluster samples, and S i Let δ(condition) be the i-th generated cluster, and let δ(condition) be the characteristic function. If the condition is met, the output is 1; otherwise, it is zero.

[0116] Understandably, when performing the first clustering of new questions, the preset number of clusters is determined based on the number of labels for existing questions, and the preset number of clusters for each subsequent round of clustering is determined based on the number of pseudo-labels for new questions in the previous round.

[0117] After removing clusters with confidence levels below a preset value, new questions from these clusters need to be reassigned to neighboring clusters, as shown below:

[0118] Obtain the Euclidean distance between the cluster center of the cluster with a confidence level lower than the preset value and the centers of the other clusters; assign the clusters with a confidence level lower than the preset value to the cluster with the smallest Euclidean distance.

[0119] In this embodiment, the cluster center refers to the location of the center point of each cluster. The cluster center can be calculated based on all points within a cluster. Euclidean distance refers to the true distance between two points in m-dimensional space, or the natural length of a vector.

[0120] In this embodiment, after obtaining clusters with confidence levels lower than a preset value, the Euclidean distance between the center of the cluster and the centers of the other clusters is calculated, and the new problem is assigned to the cluster with the smallest Euclidean distance. Correspondingly, the pseudo-label of the new problem in the removed cluster is also changed to the index of the moved cluster.

[0121] S206. Obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label.

[0122] In this embodiment of the application, the alignment label that is aligned with the pseudo-label in the current clustering result from the previous clustering result can be obtained according to the Hungarian algorithm, as shown below:

[0123] The Hungarian algorithm is used to obtain the mapping relationship between the cluster centers in the current clustering result and the cluster centers in the previous clustering result; based on the mapping relationship, the alignment labels in the previous clustering result that are aligned with the pseudo-labels in the current clustering result are obtained.

[0124] For example, the Hungarian algorithm can be used to obtain the cluster centers C from the previous round of new problem data to the current round of cluster centers C. c The mapping G, that is:

[0125] C c =C(C l )

[0126] This yields the current pseudo-label y of the new problem data. c To the previous round of pseudo-labels y align Correspondence:

[0127] y align =G -1 (y c )

[0128] Among them, G -1This represents the inverse mapping of G.

[0129] That is, the current round pseudo-label y is obtained through the Hungarian algorithm. c Alignment label y align This alignment label will be used as a new pseudo-label for the new problem and participate in the next round of new problem discovery.

[0130] S207. Iteratively execute steps S203 to S206 until there are no new clusters in the current clustering result, and output the alignment label of the new problem obtained.

[0131] In this embodiment of the application, after obtaining the alignment label of the new problem, the alignment label is used as the pseudo label of the new problem, and the classifier is retrained. That is, the steps shown in S203 to S206 are executed iteratively, and it is determined whether a new clustering result is generated during clustering, that is, a new cluster is generated. If a new clustering result is generated, training continues until there are no new clusters in the current clustering result. Then, training stops and the alignment label of the new problem obtained is output, that is, the newly discovered problem is output.

[0132] In short, the embodiments of this application pre-train a classifier based on known problem data, then extract feature vectors of new problem data through the feature extraction layer of the obtained classifier, obtain pseudo-labels through clustering, and then train a classifier on new problem data based on pseudo-labels. After continuous iteration of clustering and classification, a better feature vector of new problem data is obtained, and finally the alignment label of new problem data is obtained.

[0133] The clustering-based new problem discovery method provided in this application involves training a classifier based on existing problems and their corresponding classification labels to obtain a trained classifier. Features are extracted from new problems using the trained classifier, and the extracted feature vectors are clustered to obtain initial clustering results. The classifier is then trained on new problems with pseudo-labels, and the current feature vector of the new problem is obtained based on the trained classifier. The new problem is then clustered based on the current feature vector to obtain the current clustering results. New problems in clusters with confidence levels below a preset value are assigned to neighboring clusters. Alignment labels are obtained from the previous clustering results that align with the pseudo-labels in the current clustering results. These alignment labels are used as new pseudo-labels. This process of using alignment labels as pseudo-labels for new problems and training the classifier is repeated until no new clusters are found in the current clustering results. Because the classifier is pre-trained based on known problem data and iteratively clustered and classified, the feature vector of new problems can be obtained effectively. This solves the technical problem that existing technologies do not make full use of known problem data and do not consider the differences and relationships between new and known problems, resulting in poor clustering performance and difficulty in accurately discovering new problems.

[0134] Based on the above-mentioned clustering-based new problem discovery method, this application embodiment also provides a clustering-based new problem discovery device.

[0135] Figure 3 A schematic diagram of the structure of the clustering-based novel problem discovery device 60 provided in this application embodiment is shown below. Figure 3 As shown, it includes:

[0136] The acquisition module 301 is used to train the classifier based on the new question with pseudo-labels, and to obtain the current feature vector of the new question based on the trained classifier.

[0137] Clustering module 302 is used to cluster the new problem based on the current feature vector to obtain the current clustering result. The current clustering result includes the clusters and the index of the clusters. The index of the clusters is the pseudo label of the new problem.

[0138] Alignment module 303 is used to obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label.

[0139] The execution module 304 is used to iteratively execute the above steps until there are no new clusters in the current clustering results, and then outputs the alignment label of the new problem obtained.

[0140] Optionally, the clustering-based novel problem discovery device 60 also includes a training module 305.

[0141] Optionally, the training module 305 is further configured to train a classifier based on existing problems and corresponding classification labels to obtain a trained classifier; obtain an initial feature vector for a new problem based on the classifier; and cluster the new problem based on the initial feature vector to obtain an initial clustering result, which includes clusters and cluster numbers, the cluster numbers being pseudo-labels of the new problem.

[0142] Optionally, the training module 305 is also used to take the pseudo-label of the new problem as the real label of the classifier; calculate the KL divergence between the real label and the predicted label of the classifier based on the output of the new problem, and iteratively train the classifier based on the KL divergence; when the loss function of the KL divergence converges, the trained classifier is obtained.

[0143] Optionally, the training module 305 is also used to classify the existing problem using a classifier to obtain predicted labels; obtain the cross-entropy loss between the predicted labels and the true labels, and update the classifier parameters based on the cross-entropy loss; and repeatedly execute the process of obtaining the cross-entropy loss between the predicted labels and the true labels until the cross-entropy loss is less than a preset value.

[0144] Optionally, the acquisition module 301 is also used to acquire the confidence level of each cluster based on the preset number of clusters and the number of new questions in the clusters; to assign new questions in clusters with confidence levels lower than the preset value to neighboring clusters and obtain a new number of clusters, which is used to acquire clusters.

[0145] Optionally, the execution module 304 is further configured to obtain the Euclidean distance between the cluster center of the cluster with a confidence level lower than a preset value and the centers of the other clusters; and to assign the cluster with a confidence level lower than the preset value to the cluster with the smallest Euclidean distance.

[0146] Optionally, the execution module 304 is further configured to use the Hungarian algorithm to obtain the mapping relationship between the cluster centers in the current clustering result and the cluster centers in the previous clustering result; and to obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result based on the mapping relationship.

[0147] The clustering-based new problem discovery device provided in this application embodiment can perform... Figure 2 and Figure 3 The technical solution of the clustering-based new problem discovery method shown in the embodiment has a similar implementation principle and technical effect, and will not be described again here.

[0148] Figure 4 A schematic diagram of the structure of a cluster-based novel problem discovery electronic device provided in an embodiment of this application. For example... Figure 4 As shown, the clustering-based novel problem discovery electronic device 40 provided in this embodiment may include:

[0149] Processor 401.

[0150] Memory 402 is used to store executable instructions for the terminal device.

[0151] The processor is configured to execute the technical solution of the above-described clustering-based new problem discovery method embodiment by executing executable instructions. Its implementation principle and technical effect are similar, and will not be repeated here.

[0152] This application also provides a computer-readable storage medium storing a computer program thereon. When the computer program is executed by a processor, it implements the technical solution of the above-described clustering-based new problem discovery method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0153] In one possible implementation, a computer-readable medium may include random access memory (RAM), read-only memory (ROM), compact discread-only memory (CD-ROM) or other optical disc storage, disk storage or other magnetic storage devices, or any other medium targeted to carry or to store the required program code in the form of instructions or data structures, and accessible by a computer. Furthermore, any connection is appropriately referred to as a computer-readable medium. For example, if software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. As used herein, disks and optical discs include optical discs, laser discs, optical discs, Digital Versatile Discs (DVDs), floppy disks, and Blu-ray discs, where disks typically reproduce data magnetically, while optical discs optically reproduce data using lasers. The above combinations should also be included within the scope of computer-readable media.

[0154] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the technical solution of the above-described clustering-based new problem discovery method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.

[0155] In the specific implementation of the aforementioned terminal device or server, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.

[0156] Those skilled in the art will understand that all or part of the steps in any of the above method embodiments can be implemented by hardware associated with program instructions. The aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, all or part of the steps in the above method embodiments are performed.

[0157] If the technical solution of this application is implemented in software form and sold or used as a product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the technical solution of this application can be embodied in the form of a software product, which is stored in a storage medium and includes a computer program or several instructions. This computer software product causes a computer device (which may be a personal computer, server, network device, or similar electronic device) to execute all or part of the steps of the method described in Embodiment 1 of this application.

[0158] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A novel problem discovery method based on clustering, characterized in that, include: The classifier is trained based on the new question with pseudo-labels, and the current feature vector of the new question is obtained based on the trained classifier. The training of the classifier based on the new question with pseudo-labels includes: Step 101: Use the pseudo-label of the new problem as the real label of the classifier; calculate the KL divergence between the real label and the predicted label output by the classifier based on the new problem, and iteratively train the classifier according to the KL divergence; when the loss function of the KL divergence converges, obtain the trained classifier. Step 102: Cluster the new problem according to the current feature vector to obtain the current clustering result. The current clustering result includes clusters and cluster numbers. The cluster number is the pseudo label of the new problem. Step 103: Obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label; wherein, the clustering result includes cluster centers, and the cluster center refers to the position of each cluster center point; use the Hungarian algorithm to obtain the mapping relationship between the cluster centers in the current clustering result and the cluster centers in the previous clustering result; Based on the mapping relationship, obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result; Step 104: Iterate through the above steps until there are no new clusters in the current clustering results, and output the alignment label of the new problem obtained.

2. The method according to claim 1, characterized in that, The first new question with pseudo-labels was obtained in the following way: The classifier is trained based on the existing problem and the corresponding classification label to obtain the trained classifier. The initial feature vector for the new problem is obtained based on the classifier. The new problem is clustered based on the initial feature vector to obtain an initial clustering result, which includes clusters and cluster numbers, and the cluster numbers are pseudo-labels for the new problem.

3. The method according to claim 1, characterized in that, After clustering the new problem based on the current feature vector to obtain the current clustering result, the method further includes: The confidence level of each cluster is obtained based on the preset number of clusters and the number of new questions in the clusters. New questions in clusters with confidence levels below a preset value are assigned to neighboring clusters, resulting in a new number of clusters. This new number of clusters is used to obtain clusters.

4. The method according to claim 3, characterized in that, Assigning new questions from clusters with confidence levels below a preset value to neighboring clusters includes: Obtain the Euclidean distance between the cluster centers of the clusters with confidence levels below a preset value and the centers of the other clusters; Clusters with confidence levels below a preset value are assigned to the cluster with the smallest Euclidean distance.

5. The method according to claim 2, characterized in that, The step of training the classifier based on the existing problem and the corresponding classification label to obtain the trained classifier includes: The existing problem is classified using a classifier to obtain predicted labels; Obtain the cross-entropy loss between the predicted label and the true label, and update the classifier parameters based on the cross-entropy loss; Repeat the process of obtaining the cross-entropy loss between the predicted label and the true label until the cross-entropy loss is less than a preset value.

6. A cluster-based novel problem discovery apparatus, used to execute the cluster-based novel problem discovery method as described in claim 1, characterized in that, include: The acquisition module is used to train a classifier based on a new question with pseudo-labels, and to obtain the current feature vector of the new question based on the trained classifier; The acquisition module is specifically used to use the pseudo-label of the new problem as the real label of the classifier. Calculate the KL divergence between the true label and the predicted label output by the classifier based on the new question, and iteratively train the classifier based on the KL divergence; When the loss function of the KL divergence converges, the trained classifier is obtained; The clustering module is used to cluster the new problem based on the current feature vector to obtain the current clustering result, which includes the clusters and the index of the clusters, and the index of the clusters is the pseudo-label of the new problem. The alignment module is used to obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result, and use the alignment label as the new pseudo label; wherein, the clustering result includes cluster centers, and the cluster center refers to the position of each cluster center point; The alignment module is specifically used to obtain the mapping relationship between the cluster centers in the current clustering result and the cluster centers in the previous clustering result using the Hungarian algorithm; Based on the mapping relationship, obtain the alignment label in the previous clustering result that is aligned with the pseudo label in the current clustering result; The execution module is used to iteratively execute the above steps until there are no new clusters in the current clustering result, and then outputs the alignment label of the new problem obtained.

7. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the method of any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, It stores a computer program, which is executed by a processor to implement the method of any one of claims 1-5.