Speech transcription text clustering method and apparatus, electronic device, and storage medium
By using vector representation and clustering model training on speech-transcribed text, the problem of low clustering accuracy in existing technologies for speech-transcribed text is solved, achieving more accurate text clustering results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2022-06-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing text clustering methods suffer from problems such as high word error rates and speech disfluency when applied to speech-to-text transcription in automatic speech recognition technology, resulting in low clustering accuracy.
By extracting vector representations of speech-transcribed texts and training them based on a text clustering model, the goal is to minimize the distance between vector representations of the same speech-transcribed text, maximize the distance between vector representations of different speech-transcribed texts, minimize the distance between the vector representation of a speech-transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a speech-transcribed text and the semantic vector of other categories. Cosine similarity and data augmentation techniques are used to optimize the model.
It achieves accurate clustering of speech-transcribed text at both the text and category levels, improving clustering accuracy and reducing error rate and disfluency in speech-transcribed text.
Smart Images

Figure CN115238068B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech transcription technology, and more particularly to a speech transcription text clustering method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the rapid development of Automatic Speech Recognition (ASR) technology, a large number of Chinese ASR speech-transcribed texts have emerged.
[0003] Due to background noise in the recording and limited accuracy of recognition technology, these speech-transcribed texts generally have high word error rates and poor sentence fluency, resulting in poor performance of existing text clustering methods, such as k-means, when directly applied to ASR speech-transcribed texts. Summary of the Invention
[0004] This invention provides a method, apparatus, electronic device, and storage medium for speech-transcribed text clustering, in order to address the shortcomings of low text clustering accuracy in the prior art.
[0005] This invention provides a method for clustering speech-transcribed text, comprising:
[0006] Extract vector representations of each transcribed speech text;
[0007] The vector representation of each speech transcribed text is input into the text clustering model to obtain the clustering results of each speech transcribed text output by the text clustering model;
[0008] The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vectors of other categories.
[0009] According to the speech-transcribed text clustering method provided by the present invention, the text clustering model is trained based on the following steps:
[0010] Clustering steps: Based on the current iteration model of the text clustering model, extract the sample vector representation of each sample speech transcribed text, and perform text clustering based on each sample vector representation to obtain the current clustering result of each sample speech transcribed text;
[0011] Vector determination steps: Based on the sample vector representations of the speech-transcribed texts of each sample in the same category in the current clustering results, determine the semantic vectors of each category;
[0012] Training steps: Based on the distance between vector representations of the same sample speech transcribed text, the distance between vector representations of different sample speech transcribed text, the distance between the vector representation of the sample speech transcribed text and the semantic vector of its category, and the distance between the vector representation of the sample speech transcribed text and the semantic vector of other categories, determine the loss value of the current iterative model, and update the parameters of the current iterative model based on the loss value.
[0013] Iterative steps: Using the current iterative model with updated parameters as the current iterative model in the clustering step, the clustering step, the vector determination step, and the training step are executed sequentially until the convergence condition is met, thus obtaining the text clustering model.
[0014] According to a speech-transcribed text clustering method provided by the present invention, the method for determining the loss value of the current iterative model based on the distances between vector representations of speech-transcribed texts of the same sample, the distances between vector representations of speech-transcribed texts of different samples, the distances between the vector representations of a sample speech-transcribed text and the semantic vectors of its class, and the distances between the vector representations of a sample speech-transcribed text and the semantic vectors of other classes includes:
[0015] The text-level contrast loss value is determined based on the cosine similarity between the vector representations of the same sample speech transcribed text and the cosine similarity between the vector representations of different sample speech transcribed text.
[0016] Based on the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of its class, as well as the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of other classes, the contrast loss value at the class level is determined.
[0017] The loss value of the current iterative model is determined based on the contrastive loss value at the text level and the contrastive loss value at the category level.
[0018] According to the speech-transcribed text clustering method provided by the present invention, the contrastive loss value at the text level is determined based on the following formula:
[0019]
[0020] in, This represents the contrast loss value at the text level. The cosine similarity between the vector representations of the same sample speech-transcribed texts is represented by... The cosine similarity between the vector representations of the different sample speech-transcribed texts is denoted by τ, where τ represents the scaling factor of the cosine value, and N represents the number of samples in a training batch.
[0021] The contrastive loss value at the category level is determined based on the following formula:
[0022]
[0023]
[0024]
[0025]
[0026]
[0027] in, sim(s) represents the contrastive loss value at the category level. i ,e c ) represents the vector representation s of the transcribed text of the sample speech. i The semantic vector e of its category c The cosine similarity between them, sim(s) i ,e j ) represents the vector representation s of the transcribed text of the sample speech. i semantic vectors e of other categories j Cosine similarity between them, n c Represents semantic vector e c The number of sample speech-transcribed texts in the corresponding category, n j Represents semantic vector e j The number of sample speech-transcribed texts in the corresponding category, where α is the smoothing coefficient.
[0028] According to a speech-transcribed text clustering method provided by the present invention, the step of extracting the vector representation of each speech-transcribed text includes:
[0029] Each speech transcribed text is encoded to obtain a set of character encoding vectors for each speech transcribed text;
[0030] The average value of each vector in the character encoding vector set is applied to obtain the vector representation of each speech transcribed text.
[0031] According to the speech-transcribed text clustering method provided by the present invention, the vector representation of each speech-transcribed text is determined based on the following formula:
[0032]
[0033] Among them, S iThe vector representation of each speech transcript, where n represents the number of characters in each speech transcript, and e represents the number of characters in each speech transcript. cls The encoding vector representing the starting character of each speech-transcribed text, e sep The encoding vector representing the end character of each transcribed text, e j This represents the character encoding vector of each transcribed speech text.
[0034] According to the speech-transcribed text clustering method provided by the present invention, the distance between the vector representations of the same sample speech-transcribed texts is determined based on the following steps:
[0035] Data augmentation is performed on the speech transcribed text of each sample to obtain the augmented text of each sample speech transcribed text, and the vector representation of each augmented text is extracted;
[0036] Based on the vector representation of each sample speech transcribed text and the corresponding vector representation of each enhanced text, the distance between the vector representations of the same sample speech transcribed text is determined.
[0037] The present invention also provides a speech-transcribed text clustering device, comprising:
[0038] Extraction unit, used to extract vector representations of each speech transcription text;
[0039] The clustering unit is used to input the vector representation of each speech transcribed text into the text clustering model to obtain the clustering result of each speech transcribed text output by the text clustering model;
[0040] The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vectors of other categories.
[0041] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement any of the above-described speech-transcribed text clustering methods.
[0042] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the speech-transcribed text clustering method described above.
[0043] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the speech-transcribed text clustering method described above.
[0044] The speech-transcribed text clustering method, apparatus, electronic device, and storage medium provided by this invention iteratively update and train a text clustering model with the objectives of minimizing the distance between vector representations of the same sample speech-transcribed text, maximizing the distance between vector representations of different sample speech-transcribed text, minimizing the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximizing the distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories. Ultimately, the text clustering model can cluster each speech-transcribed text at both the text and category levels, thereby accurately obtaining the clustering results. Attached Figure Description
[0045] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0046] Figure 1 This is a flowchart illustrating the speech-transcribed text clustering method provided by the present invention;
[0047] Figure 2 This is a flowchart illustrating the text clustering model training method provided by the present invention;
[0048] Figure 3 This is a schematic diagram of the structure of the speech-to-text clustering device provided by the present invention;
[0049] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0050] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0051] Due to background noise in the recording and limited accuracy of recognition technology, these speech-transcribed texts generally have high word error rates and poor sentence fluency, resulting in poor performance of existing text clustering methods, such as k-means, when directly applied to ASR speech-transcribed texts.
[0052] In response, this invention provides a method for clustering speech-transcribed text. Figure 1 This is a flowchart illustrating the speech-to-text clustering method provided by the present invention, as shown below. Figure 1 As shown, the method includes the following steps:
[0053] Step 110: Extract the vector representation of each speech transcription text.
[0054] Here, the speech-transcribed text refers to the text that needs to be clustered. This speech-transcribed text can be obtained by speech recognition of recorded audio. The vector representation of each speech-transcribed text is used to characterize the semantic information of each speech-transcribed text, and it can be obtained by encoding each speech-transcribed text.
[0055] Step 120: Input the vector representation of each speech transcribed text into the text clustering model to obtain the clustering results of each speech transcribed text output by the text clustering model;
[0056] The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vector of other categories.
[0057] Specifically, the distance between the vector representations of identical sample speech-transcribed texts is used to characterize the difference between two identical sample speech-transcribed texts. For example, after data augmentation of any sample speech-transcribed text, augmented text is obtained, and the distance between the vector representations of the corresponding sample speech-transcribed text and the augmented text is used as the distance between the vector representations of identical sample speech-transcribed texts. The distance between the vector representations of different sample speech-transcribed texts is used to characterize the difference between two different sample speech-transcribed texts. For example, the distance between the vector representations of any two different sample speech-transcribed texts among multiple sample speech-transcribed texts is used as the distance between the vector representations of different sample speech-transcribed texts. This embodiment of the invention trains a text clustering model with the goal of minimizing the distance between the vector representations of identical sample speech-transcribed texts and maximizing the distance between the vector representations of different sample speech-transcribed texts. It can perform comparative learning at the level of each sample speech-transcribed text itself, that is, learn information between identical sample speech-transcribed texts and information between different sample speech-transcribed texts at the text level.
[0058] Furthermore, different sample speech-transcribed texts have different semantic information. If the semantic information similarity is high, the corresponding sample speech-transcribed texts can be clustered into one class; if the semantic similarity is low, the corresponding sample speech-transcribed texts can be divided into two different categories. For sample speech-transcribed texts belonging to the same category, cluster centers can be determined based on the vector representations of the corresponding sample speech-transcribed texts, and the cluster centers can be used as the semantic vectors of the corresponding categories. The distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category is used to characterize the difference between the vector representation of the sample speech-transcribed text and the semantic vectors of the same category. The distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories is used to characterize the difference between the vector representation of the sample speech-transcribed text and the semantic vectors of different categories. This embodiment of the invention trains a text clustering model with the goal of minimizing the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximizing the distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories. It can perform comparative learning at the level of sample speech-transcribed texts of different categories, that is, learn information between sample speech-transcribed texts of the same category and between sample speech-transcribed texts of different categories at the category level.
[0059] Therefore, the speech-transcribed text clustering method provided in this embodiment of the invention aims to minimize the distance between the vector representations of the same sample speech-transcribed text, maximize the distance between the vector representations of different sample speech-transcribed text, minimize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of other categories. This results in an iterative update training of the text clustering model, which is ultimately able to cluster each speech-transcribed text at both the text and category levels, thereby accurately obtaining the clustering results.
[0060] Based on the above embodiments, the text clustering model is trained using the following steps:
[0061] Clustering steps: Based on the current iteration of the text clustering model, extract the sample vector representation of each sample speech transcribed text, and perform text clustering based on each sample vector representation to obtain the current clustering result of each sample speech transcribed text;
[0062] Vector determination steps: Based on the sample vector representations of the speech-transcribed texts of each sample in the same category in the current clustering results, determine the semantic vectors of each category;
[0063] Training steps: Based on the distance between the vector representations of the same sample speech transcribed text, the distance between the vector representations of different sample speech transcribed text, the distance between the vector representation of the sample speech transcribed text and the semantic vector of its category, and the distance between the vector representation of the sample speech transcribed text and the semantic vector of other categories, determine the loss value of the current iteration model, and update the parameters of the current iteration model based on the loss value.
[0064] Iterative steps: Using the current iterative model with updated parameters as the current iterative model in the clustering step, the clustering step, vector determination step, and training step are executed repeatedly until the convergence condition is met, resulting in a text clustering model.
[0065] Specifically, the vector representation of each sample speech transcription text is used to characterize the semantic information of each sample speech transcription text, and it can be obtained by encoding each sample speech transcription text. Based on each sample vector representation, the distance between each sample vector representation can be determined. The larger the distance, the greater the difference between the corresponding sample speech transcription texts, that is, the greater the probability that the corresponding sample speech transcription texts belong to different categories. Conversely, the smaller the distance, the smaller the difference between the corresponding sample speech transcription texts, that is, the greater the probability that the corresponding sample speech transcription texts belong to the same category. Based on this, text clustering based on each sample vector representation can obtain the initial clustering result of each sample speech transcription text. Optionally, the K-means clustering algorithm can be used to cluster each sample vector representation to obtain the current clustering result.
[0066] After obtaining the current clustering results, the sample speech transcription text contained in each category can be obtained. Then, based on the sample vector representation of the sample speech transcription text in each category, the center of each category is determined, and the center of each category is used as the semantic vector of each category.
[0067] After obtaining the semantic vectors for each category, the loss value of the current iterative model is determined based on the distances between the vector representations of the same sample speech-transcribed text, the distances between the vector representations of different sample speech-transcribed text, the distances between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and the distances between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories. Then, the parameters of the current iterative model are updated based on the loss value, and the updated current iterative model is used as the current iterative model in the clustering step. The clustering step, vector determination step, and training step are repeated sequentially, enabling the obtained text clustering model to cluster each speech-transcribed text at both the text and category levels, thereby accurately obtaining the clustering results. The convergence condition can be that the accuracy of the text clustering model reaches a threshold or the number of training iterations reaches a preset number; however, this embodiment of the invention does not specifically limit this.
[0068] It should be noted that after each parameter update of the current iteration model based on the loss value, the current iteration model with updated parameters is used as the current iteration model in the clustering step, thereby repeatedly executing the clustering-training loop step. This allows the current iteration model to continuously learn new knowledge in the previous iteration training process, thereby continuously improving the clustering accuracy of the current iteration model.
[0069] Based on any of the above embodiments, the loss value of the current iterative model is determined based on the distance between vector representations of the same sample speech-transcribed text, the distance between vector representations of different sample speech-transcribed text, the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and the distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories, including:
[0070] The text-level contrast loss value is determined based on the cosine similarity between the vector representations of the same sample speech transcribed text and the cosine similarity between the vector representations of different sample speech transcribed text.
[0071] Based on the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of its class, as well as the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of other classes, the contrast loss value at the class level is determined.
[0072] The loss value of the current iterative model is determined based on the contrastive loss value at the text level and the contrastive loss value at the category level.
[0073] Specifically, the cosine similarity between the vector representations of the same sample speech transcribed text is used to characterize the distance between the vector representations of the same sample speech transcribed text, and the cosine similarity between the vector representations of different sample speech transcribed text is used to characterize the distance between the vector representations of different sample speech transcribed text. Based on the two, the text-level contrast loss value can be obtained.
[0074] Furthermore, the cosine similarity between the vector representation of the sample speech transcription text and the semantic vector of its class is used to characterize the distance between the vector representation of the sample speech transcription text and the semantic vector of its class, and the cosine similarity between the vector representation of the sample speech transcription text and the semantic vectors of other classes is used to characterize the distance between the vector representation of the sample speech transcription text and the semantic vectors of other classes. Based on these two, a class-level contrastive loss value can be obtained.
[0075] Finally, based on the contrastive loss values at the text level and the contrastive loss values at the category level, the loss value of the text clustering model is determined. For example, the contrastive loss values at the text level and the contrastive loss values at the category level are weighted and added together to obtain the loss value of the current iteration model.
[0076] Based on any of the above embodiments, the text-level contrast loss value is determined using the following formula:
[0077]
[0078] in, This represents the contrast loss value at the text level. Cosine similarity between vector representations of speech-transcribed texts from the same sample. The cosine similarity between the vector representations of different sample speech transcriptions is represented by τ, where τ represents the scaling factor of the cosine value to avoid gradient vanishing issues during training, and N represents the number of samples in a training batch. The text-level contrastive loss aims to make each sample speech transcription closer to the positive samples generated from itself in the feature space, and further away from other different sample speech transcriptions in the feature space.
[0079] The contrastive loss value at the category level is determined based on the following formula:
[0080]
[0081]
[0082]
[0083]
[0084]
[0085] in, sim(s) represents the contrastive loss value at the category level. i ,e c ) represents the vector representation s of the transcribed text of the sample speech. i The semantic vector e of its category c The cosine similarity between them, sim(s) i ,e j ) represents the vector representation s of the transcribed text of the sample speech. i semantic vectors e of other categories j Cosine similarity between them, n c Represents semantic vector e c The number of sample speech-transcribed texts in the corresponding category, n j Represents semantic vector e j The number of sample speech-transcribed texts in the corresponding category, where α is a smoothing coefficient to prevent loose clustering φ. c It tends towards positive infinity. φ c and φ j The degree of clustering looseness is used to measure the authenticity of the semantic vectors of the corresponding categories. Specifically, the smaller the degree of clustering looseness, the more concentrated the text of that category is in space, and the more likely the semantic vectors of the corresponding category can be considered to represent the semantics of that category well.
[0086] The looser the clustering, the more loss the semantic vectors of the corresponding categories provide. The goal of the category-level contrastive loss is to make the semantic vectors of the sample speech transcription text closer to the semantic vectors of its clustering category in the feature space, and farther away from the semantic vectors of other categories in the feature space, so that the algorithm can more perfectly achieve the goal of clustering text according to semantics.
[0087] Optionally, after obtaining the contrastive loss values at the text level and the contrastive loss values at the category level, the loss value of the text clustering model can be determined based on the following formula:
[0088]
[0089] Where L represents the loss value of the text clustering model, λ is a parameter that balances the contrastive loss value at the text level and the contrastive loss value at the category level. Finally, Adam (a variant of the backpropagation algorithm) can be used to optimize the loss function and train the encoder used to obtain the text representation to obtain the values of the trainable parameters.
[0090] Based on any of the above embodiments, the vector representation of each speech transcribed text is extracted, including:
[0091] Each speech transcribed text is encoded to obtain a set of character encoding vectors for each speech transcribed text;
[0092] The average value of each vector in the character encoding vector set is applied to obtain the vector representation of each speech transcribed text.
[0093] Specifically, when encoding each speech-transcribed text, a pre-trained language model such as Bert / Roberta can be used to obtain the character encoding vector set E = {e} for each speech-transcribed text. cls ,e0,e1…e n-1 ,e sep}, where each vector in the set represents a character vector encoded from the spoken transcribed text. Typically, e is used. cls As a vector representation of speech-transcribed text, but in clustering tasks, because e cls The downstream will not connect to any classification layer, e cls Clustering cannot effectively capture semantic information, therefore this embodiment of the invention abandons the use of e. cls Instead, it performs mean pooling (mean pooling) on all vectors in the character encoded vector set after text encoding to extract the vector representation of each speech transcribed text.
[0094] It should be noted that since the semantic information of speech-transcribed text, such as sentiment, is usually determined by the sentiment of most characters in the speech-transcribed text, this embodiment of the invention uses the mean pooling method to extract the vector representation of the speech-transcribed text. This allows texts with similar semantic content in the corpus to have more similar text representations in the vector space. Furthermore, the mean pooling layer can effectively mitigate the impact of a small number of erroneous words in the speech-transcribed text on the classification results, thereby further improving the accuracy of the clustering results.
[0095] Based on any of the above embodiments, the vector representation of each speech-transcribed text is determined based on the following formula:
[0096]
[0097] Among them, S i The vector representation of each speech transcript, where n represents the number of characters in each speech transcript, and e represents the number of characters in each speech transcript. cls The encoding vector representing the starting character of each speech-transcribed text, e sep The encoding vector representing the end character of each transcribed text, e j This represents the character encoding vector of each transcribed speech text.
[0098] Based on any of the above embodiments, the distance between the vector representations of the same sample speech-transcribed text is determined based on the following steps:
[0099] Data augmentation is performed on the speech transcribed text of each sample to obtain the augmented text of each sample speech transcribed text, and the vector representation of each augmented text is extracted;
[0100] Based on the vector representations of the speech transcribed texts of each sample and the corresponding vector representations of the enhanced texts, the distance between the vector representations of the speech transcribed texts of the same sample is determined.
[0101] Specifically, by utilizing the randomness of the dropout mechanism in deep learning models, the same sample speech-transcribed text is re-encoded to obtain another vector representation of the text, which is the vector representation of the enhanced text. This enhanced text is then used as a positive sample pair with the vector representation of the corresponding sample speech-transcribed text. The distance between the two vector representations in this positive sample pair is determined, which is the distance between the vector representations of the same sample speech-transcribed text.
[0102] Based on any of the above embodiments, the present invention also provides a method for training a text clustering model, such as... Figure 2 As shown, the method includes the following steps:
[0103] First, sample speech-transcribed texts are collected and preprocessed. Next, an encoder is used to extract a set of character encoding vectors from the sample speech-transcribed texts, and the mean of all vectors in the set is applied to obtain the vector representation of the sample speech-transcribed text. The encoder can be trained based on a pre-trained BERT model.
[0104] Subsequently, based on the vector representations of the sample speech-transcribed texts, the K-means clustering algorithm is used to cluster the sample speech-transcribed texts, obtaining the current clustering result. After obtaining the current clustering result, the text-level contrastive loss value is determined based on the cosine similarity between the vector representations of the same sample speech-transcribed texts and the cosine similarity between the vector representations of different sample speech-transcribed texts. The category-level contrastive loss value is determined based on the cosine similarity between the vector representation of the sample speech-transcribed text and the semantic vector of its category, as well as the cosine similarity between the vector representation of the sample speech-transcribed text and the semantic vector of other categories. Then, based on the text-level contrastive loss value and the category-level contrastive loss value, the loss value of the current iterative model is determined.
[0105] Next, backpropagation is performed based on the loss value of the current iteration model to update the current iteration model. After the update is complete, clustering is performed again to obtain a new current clustering result. Based on the new current clustering result, the current iteration model is trained and its parameters are updated, and finally the trained text clustering model is obtained.
[0106] The speech-transcribed text clustering device provided by the present invention is described below. The speech-transcribed text clustering device described below can be referred to in correspondence with the speech-transcribed text clustering method described above.
[0107] Based on any of the above embodiments, the present invention also provides a speech-transcribed text clustering device, such as... Figure 3 As shown, the device includes:
[0108] Extraction unit 310 is used to extract vector representations of each speech transcription text;
[0109] Clustering unit 320 is used to input the vector representation of each speech transcribed text into the text clustering model to obtain the clustering result of each speech transcribed text output by the text clustering model;
[0110] The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vectors of other categories.
[0111] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 4 As shown, the electronic device may include a processor 410, a memory 420, a communication interface 430, and a communication bus 440. The processor 410, memory 420, and communication interface 430 communicate with each other via the communication bus 440. The processor 410 can call logical instructions in the memory 420 to execute a speech-transcribed text clustering method. This method includes: extracting vector representations of each speech-transcribed text; inputting the vector representations of each speech-transcribed text into a text clustering model to obtain the clustering results of each speech-transcribed text output by the text clustering model; the text clustering model is trained based on the vector representations of multiple sample speech-transcribed texts and the clustering results of each sample speech-transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech-transcribed text, maximize the distance between the vector representations of different sample speech-transcribed texts, minimize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories.
[0112] Furthermore, the logical instructions in the aforementioned memory 420 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0113] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer is able to execute the speech-transcribed text clustering method provided by the above methods, the method including: extracting vector representations of each speech-transcribed text; inputting the vector representations of each speech-transcribed text into a text clustering model to obtain the clustering results of each speech-transcribed text output by the text clustering model; the text clustering model is trained based on the vector representations of multiple sample speech-transcribed texts and the clustering results of each sample speech-transcribed text, the training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech-transcribed text, maximize the distance between the vector representations of different sample speech-transcribed texts, minimize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of other categories.
[0114] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the aforementioned speech-transcribed text clustering methods. The method includes: extracting vector representations of each speech-transcribed text; inputting the vector representations of each speech-transcribed text into a text clustering model to obtain the clustering results of each speech-transcribed text output by the text clustering model; the text clustering model is trained based on the vector representations of multiple sample speech-transcribed texts and the clustering results of each sample speech-transcribed text, and the training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech-transcribed text, maximize the distance between the vector representations of different sample speech-transcribed texts, minimize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech-transcribed text and the semantic vector of other categories.
[0115] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0116] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0117] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A speech-transcribed text clustering method, characterized in that, include: Extract vector representations of each transcribed speech text; The vector representation of each speech transcript is used to characterize the semantic information of each speech transcript; The vector representation of each speech transcribed text is input into the text clustering model to obtain the clustering results of each speech transcribed text output by the text clustering model; The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vector of other categories. The text clustering model is trained based on the following steps: Clustering steps: Based on the current iteration model of the text clustering model, extract the sample vector representation of each sample speech transcribed text, and perform text clustering based on each sample vector representation to obtain the current clustering result of each sample speech transcribed text; Vector determination steps: Based on the sample vector representations of the speech-transcribed texts of each sample in the same category in the current clustering results, determine the semantic vectors of each category; Training steps: Based on the distance between vector representations of the same sample speech transcribed text, the distance between vector representations of different sample speech transcribed text, the distance between the vector representation of the sample speech transcribed text and the semantic vector of its category, and the distance between the vector representation of the sample speech transcribed text and the semantic vector of other categories, determine the loss value of the current iterative model, and update the parameters of the current iterative model based on the loss value. Iterative steps: Using the current iterative model with updated parameters as the current iterative model in the clustering step, the clustering step, the vector determination step, and the training step are executed sequentially until the convergence condition is met, and the text clustering model is obtained. The extraction of vector representations for each speech transcribed text includes: Each speech transcribed text is encoded to obtain a set of character encoding vectors for each speech transcribed text; The average value of each vector in the character encoding vector set is applied to obtain the vector representation of each speech transcribed text; The vector representation of each speech transcription text is determined based on the following formula: ; in, Vector representations of each transcribed speech text. This indicates the number of characters in each transcribed text. This represents the encoding vector of the starting character of each transcribed text. This represents the encoding vector of the end character of each transcribed text. This represents the character encoding vector of each transcribed speech text.
2. The speech-transcribed text clustering method according to claim 1, characterized in that, The loss value of the current iterative model is determined by the distances between vector representations of the same sample speech-transcribed text, the distances between vector representations of different sample speech-transcribed text, the distance between the vector representation of a sample speech-transcribed text and the semantic vector of its corresponding category, and the distance between the vector representation of a sample speech-transcribed text and the semantic vectors of other categories. The text-level contrast loss value is determined based on the cosine similarity between the vector representations of the same sample speech transcribed text and the cosine similarity between the vector representations of different sample speech transcribed text. Based on the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of its class, as well as the cosine similarity between the vector representation of the sample speech transcribed text and the semantic vector of other classes, the contrast loss value at the class level is determined. The loss value of the current iterative model is determined based on the contrastive loss value at the text level and the contrastive loss value at the category level.
3. The speech-to-text clustering method according to claim 2, characterized in that, The text-level contrast loss value is determined based on the following formula: ; in, This represents the contrast loss value at the text level. The cosine similarity between the vector representations of the same sample speech-transcribed texts is represented by... The cosine similarity between the vector representations of the different sample speech-transcribed texts is given. Indicates the scaling factor of the cosine value. Indicates the number of samples in a training batch; The contrastive loss value at the category level is determined based on the following formula: ; ; ; ; ; in, This represents the contrastive loss value at the category level. Vector representation of the transcribed text of the sample speech semantic vector of its category Cosine similarity between them Vector representation of the transcribed text of the sample speech semantic vectors of other categories Cosine similarity between them Represents semantic vectors The number of sample speech-transcribed texts in the corresponding category, Represents semantic vectors The number of sample speech-transcribed texts in the corresponding category, This is the smoothing coefficient.
4. The speech-transcribed text clustering method according to any one of claims 1 to 3, characterized in that, The distance between the vector representations of the same sample speech transcribed texts is determined based on the following steps: Data augmentation is performed on the speech transcribed text of each sample to obtain the augmented text of each sample speech transcribed text, and the vector representation of each augmented text is extracted; Based on the vector representation of each sample speech transcribed text and the corresponding vector representation of each enhanced text, the distance between the vector representations of the same sample speech transcribed text is determined.
5. A speech-to-text clustering device, characterized in that, include: Extraction unit, used to extract vector representations of each speech transcription text; The vector representation of each speech transcript is used to characterize the semantic information of each speech transcript; The clustering unit is used to input the vector representation of each speech transcribed text into the text clustering model to obtain the clustering result of each speech transcribed text output by the text clustering model; The text clustering model is trained based on the vector representations of multiple sample speech transcribed texts and the clustering results of each sample speech transcribed text. The training of the text clustering model aims to minimize the distance between the vector representations of the same sample speech transcribed text, maximize the distance between the vector representations of different sample speech transcribed texts, minimize the distance between the vector representation of a sample speech transcribed text and the semantic vector of its category, and maximize the distance between the vector representation of a sample speech transcribed text and the semantic vector of other categories. The text clustering model is trained based on the following steps: Clustering steps: Based on the current iteration model of the text clustering model, extract the sample vector representation of each sample speech transcribed text, and perform text clustering based on each sample vector representation to obtain the current clustering result of each sample speech transcribed text; Vector determination steps: Based on the sample vector representations of the speech-transcribed texts of each sample in the same category in the current clustering results, determine the semantic vectors of each category; Training steps: Based on the distance between vector representations of the same sample speech transcribed text, the distance between vector representations of different sample speech transcribed text, the distance between the vector representation of the sample speech transcribed text and the semantic vector of its category, and the distance between the vector representation of the sample speech transcribed text and the semantic vector of other categories, determine the loss value of the current iterative model, and update the parameters of the current iterative model based on the loss value. Iterative steps: Using the current iterative model with updated parameters as the current iterative model in the clustering step, the clustering step, the vector determination step, and the training step are executed sequentially until the convergence condition is met, and the text clustering model is obtained. The extraction of vector representations for each speech transcribed text includes: Each speech transcribed text is encoded to obtain a set of character encoding vectors for each speech transcribed text; The average value of each vector in the character encoding vector set is applied to obtain the vector representation of each speech transcribed text; The vector representation of each speech transcription text is determined based on the following formula: ; in, Vector representations of each transcribed speech text. This indicates the number of characters in each transcribed text. This represents the encoding vector of the starting character of each transcribed text. This represents the encoding vector of the end character of each transcribed text. This represents the character encoding vector of each transcribed speech text.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the speech-transcribed text clustering method as described in any one of claims 1 to 4.
7. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the speech-transcribed text clustering method as described in any one of claims 1 to 4.