A topic modeling method, device, electronic equipment and computer readable storage medium

By combining acoustic and semantic topic modeling methods, and utilizing feature extraction models and density clustering algorithms, the problem of inaccurate topic modeling in existing technologies is solved, achieving more accurate document classification and target topic determination.

CN117216012BActive Publication Date: 2026-06-26CHINA UNICOM (GUANGDONG) IND INTERNET CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA UNICOM (GUANGDONG) IND INTERNET CO LTD
Filing Date
2023-09-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing topic modeling techniques lack acoustic and semantic understanding, making it difficult to accurately reflect the true topic content of a document, resulting in modeling results that are less likely to accurately reflect the document's topic.

Method used

By combining acoustic and semantic understanding, feature vectors of speech and text data are extracted using a feature extraction model and then fused. A density clustering algorithm is used to cluster documents, and word frequency analysis is combined to determine the target topic.

Benefits of technology

It improves the accuracy of document classification and the quality and interpretability of target topics, making the identified target topics closer to the true topics of the documents and enhancing the accuracy of topic representation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117216012B_ABST
    Figure CN117216012B_ABST
Patent Text Reader

Abstract

Embodiments of the present application relate to a topic modeling method and device, electronic equipment and computer readable storage medium. The method comprises: obtaining a plurality of document data; extracting features of voice data and text data included in first document data by a feature extraction model to obtain acoustic feature vectors and text feature vectors, and fusing the acoustic feature vectors and the text feature vectors to obtain a first acoustic semantic vector corresponding to the first document data; the first document data is any document data; clustering the plurality of document data according to the first acoustic semantic vectors corresponding to the plurality of document data to obtain a plurality of document categories; and determining a target topic corresponding to each document category according to the document data included in each document category. The topic modeling method, device, electronic equipment and computer readable storage medium can accurately determine the target topic of each document category by combining acoustic and semantic understanding.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to a topic modeling method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] Companies in the internet and telecommunications sectors generate massive amounts of document data daily during production and user interactions. Due to the sheer volume of data, high-value information is easily buried under a flood of useless data, making it difficult to acquire and effectively utilize. Topic modeling techniques can uncover commonalities from these massive documents, structurally categorize similar documents, and extract corresponding topics. Furthermore, topic modeling can indirectly reflect the changing trends of massive amounts of text data over time, aiding in the analysis of data patterns and internal relationships. Therefore, how to more accurately extract document topics has become a current research hotspot. Summary of the Invention

[0003] This application discloses a topic modeling method, apparatus, electronic device, and computer-readable storage medium that can accurately determine the target topic for each document category by combining acoustic and semantic understanding.

[0004] In a first aspect, embodiments of this application disclose a topic modeling method, including:

[0005] Acquire multiple document data, the document data including text data and corresponding voice data;

[0006] The first document data includes speech data and text data, which are extracted by a feature extraction model to obtain acoustic feature vectors and text feature vectors. The acoustic feature vectors and text feature vectors are then fused to obtain the first acoustic semantic vector corresponding to the first document data. The first document data can be any of the document data mentioned above.

[0007] Based on the first acoustic semantic vectors corresponding to the multiple document data respectively, the multiple document data are clustered to obtain multiple document categories;

[0008] Based on the document data contained in each document category, determine the target topic corresponding to each document category.

[0009] As an optional implementation, in a first aspect of this application, the step of clustering the plurality of document data according to the first acoustic semantic vectors corresponding to the plurality of document data respectively to obtain a plurality of document categories includes:

[0010] The first acoustic semantic vectors corresponding to the multiple document data are mapped from the first vector space to the second vector space to obtain the second acoustic semantic vectors corresponding to the multiple document data; the dimension of the first vector space is larger than that of the second vector space.

[0011] The second acoustic semantic vectors corresponding to the multiple document data are clustered according to the density clustering algorithm to obtain multiple document categories, and the multiple document categories correspond one-to-one with the multiple clusters obtained by clustering.

[0012] As an optional implementation, in a first aspect of the embodiments of this application, determining the target topic corresponding to each document category based on the document data contained in each document category includes:

[0013] All document data contained in the first document category are treated as a single long document; the first document category can be any of the document categories described above.

[0014] The target topic corresponding to the first document category is determined based on the word frequency of each word contained in the long document corresponding to the first document category.

[0015] As an optional implementation, in a first aspect of the embodiments of this application, after determining the target topic corresponding to each of the document categories, the method further includes:

[0016] Calculate the similarity between target topics corresponding to any two current document categories;

[0017] Two document categories with a similarity greater than a similarity threshold are merged, and the target topics corresponding to the merged document categories are redefined to obtain updated document categories and target topics corresponding to each document category.

[0018] The updated document categories are used as the new current document categories. The step of calculating the similarity between target topics corresponding to any two current document categories is repeated until the similarity between target topics corresponding to any two current document categories is no greater than the similarity threshold.

[0019] As an optional implementation, in a first aspect of the embodiments of this application, the training process of the feature extraction model includes:

[0020] Load the model parameters obtained from pre-training using a sample corpus set to construct a pre-trained feature extraction model;

[0021] Acquire multiple sample speech data and the corresponding sample text data for each of the sample speech data;

[0022] The pre-trained feature extraction model calculates a first loss based on the sample speech data, calculates a second loss based on the sample speech data and the corresponding sample text data, and determines a target loss based on the first loss and the second loss corresponding to the sample speech data. The parameters of the pre-trained feature extraction model are adjusted according to the gradient descent direction of the target loss until the model convergence condition is met, thus obtaining the trained feature extraction model.

[0023] As an optional implementation, in a first aspect of the embodiments of this application, the step of calculating a first loss based on the sample speech data and calculating a second loss based on the sample speech data and the corresponding sample text data includes:

[0024] The speech recognition module of the pre-trained feature extraction model performs speech recognition on the sample speech data to obtain the transcription labels corresponding to the sample speech data;

[0025] Calculate the first loss based on the sample speech data and the corresponding transcription tags;

[0026] The sample speech data and the corresponding sample text data are fused to obtain fused data;

[0027] The fused data is subjected to feature extraction using the pre-trained feature extraction model to obtain a feature vector of the fused data, and a second loss is calculated based on the feature vector.

[0028] As an optional implementation, in a first aspect of the embodiments of this application, determining the target loss based on the first loss and the corresponding second loss of the sample speech data includes:

[0029] The second loss is multiplied by the maximum path length to obtain the first value; the maximum path length is the maximum path length when the speech recognition module generates the transcription label corresponding to the sample speech data;

[0030] The target loss is obtained by weighted summation of the first value and the first loss.

[0031] Secondly, embodiments of this application disclose a topic modeling apparatus, including:

[0032] The data acquisition module is used to acquire multiple document data, including text data and voice data corresponding to the text data;

[0033] The feature extraction module is used to extract features from the speech data and text data included in the first document data using a feature extraction model, to obtain acoustic feature vectors and text feature vectors, and to fuse the acoustic feature vectors and text feature vectors to obtain a first acoustic semantic vector corresponding to the first document data; the first document data is any of the document data mentioned above.

[0034] The clustering module is used to cluster the multiple document data according to the first acoustic semantic vectors corresponding to the multiple document data respectively, so as to obtain multiple document categories;

[0035] The topic determination module is used to determine the target topic corresponding to each document category based on the document data contained in each document category.

[0036] Thirdly, embodiments of this application disclose an electronic device, including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to implement the method described in any of the above embodiments.

[0037] Fourthly, embodiments of this application disclose a computer-readable storage medium that stores a computer program, which, when executed by a processor, implements the methods described in any of the above embodiments.

[0038] The topic modeling method, apparatus, electronic device, and computer-readable storage medium disclosed in this application acquire multiple document data, including text data and corresponding speech data. A feature extraction model is used to extract features from the speech data and text data included in the first document data, obtaining acoustic feature vectors and text feature vectors. These acoustic and text feature vectors are then fused to obtain a first acoustic semantic vector corresponding to the first document data. The first document data can be any document data. Based on the first acoustic semantic vectors corresponding to the multiple document data, the multiple document data are clustered to obtain multiple document categories. Based on the document data contained in each document category, the target topic corresponding to each document category is determined. In this application embodiment, the feature extraction model can extract acoustic and text features from the document data. By clustering the document data using the first acoustic semantic vector that fuses acoustic and text features, and by mining and combining the acoustic and semantic understanding in the documents, document classification becomes more accurate, and the quality and interpretability of the determined target topics are improved. This makes the target topics closer to the true topics of the documents, accurately determining the target topics of each document category and improving the accuracy of the target topic representation. Attached Figure Description

[0039] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0040] Figure 1 Here is a flowchart of a topic modeling method in one embodiment;

[0041] Figure 2 A flowchart of a topic modeling method in another embodiment;

[0042] Figure 3 A flowchart for topic optimization in one embodiment;

[0043] Figure 4 This is a flowchart illustrating the training process of a feature extraction model in one embodiment.

[0044] Figure 5 This is a flowchart illustrating the generation of the target loss in one embodiment;

[0045] Figure 6 A block diagram of a subject modeling apparatus in one embodiment;

[0046] Figure 7 This is a structural block diagram of an electronic device in one embodiment. Detailed Implementation

[0047] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0048] It should be noted that the terms "comprising" and "having," and any variations thereof, in the embodiments and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the steps or units listed, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices.

[0049] It is understood that the terms "first," "second," etc., used in this application may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of this application, a first loss may be referred to as a second loss, and similarly, a second loss may be referred to as a first loss. Both the first loss and the second loss are loss values, but they are not the same loss values.

[0050] In related technologies, commonly used topic modeling techniques generally employ probabilistic statistical methods, where the probability index of a word's appearance in a document is used as its importance weight, and the top few words with higher importance weights are taken as the document's topic. This type of statistical method lacks an understanding of the document at the acoustic and semantic levels, struggles to handle issues such as polysemy or tonal variations, and generates low-value topic information, making it difficult for the modeling results to accurately reflect the document's true topic content.

[0051] This application discloses a topic modeling method, apparatus, electronic device, and computer-readable storage medium that can mine and combine acoustic and semantic understanding in documents, making document classification more accurate and improving the quality and interpretability of the identified target topics, and accurately determining the target topics of each document category.

[0052] like Figure 1 As shown, in one embodiment, a topic modeling method is provided, which can be applied to electronic devices, including but not limited to mobile phones, smart wearable devices, tablets, PCs (Personal Computers), computers, etc. The method may include the following steps:

[0053] Step 110: Obtain data from multiple documents.

[0054] The document data includes text data and corresponding voice data. Optionally, the document data can be call documents, voice dialogue documents, etc., including text data and voice data, and the text data and voice data have a corresponding relationship.

[0055] Topic modeling is a document modeling technique in the field of natural language processing used for merging similar document content and extracting features. It is an unsupervised machine learning method that essentially aims to structure text data for subsequent tasks such as data comparison, querying, and information mining.

[0056] In some embodiments, a read function can be used to read multiple document data from memory, including text data and corresponding audio data. For large amounts of document data, preprocessing is required before use to convert the document data into data that can be processed by a computer.

[0057] In some embodiments, preprocessing of document data may include, but is not limited to, text cleaning, word segmentation, vectorization, and speech conversion. Text cleaning may include filtering out excessively short or invalid text data, or arranging text data sequentially or in reverse order based on its length. Text cleaning can be achieved by reading a user's common words and stop words dictionary to remove special symbols and stop words from the text data. Using stop words can improve the quality of text data. This common words and stop words dictionary can be one found in various online databases, or it can be a dictionary created based on the user's personal usage.

[0058] In some embodiments, word segmentation of text data may include word segmentation methods based on a vocabulary (or dictionary). Word segmentation methods based on a vocabulary include forward maximum matching method (FMM), backward maximum matching method (BMM), bidirectional scanning method, etc. Word segmentation methods based on statistical models (HMM and n-grams) may also be used, or the Jieba word segmentation tool may be used directly to segment the text data.

[0059] In some embodiments, a tokenizer can be used to vectorize the segmented text data, transforming the string-type text data into an input data type supported by the feature extraction model, enabling the text data to be processed on a CPU (Central Processing Unit) or GPU (Graphics Processing Unit). A tokenizer is a typical vectorization tool that converts text data into integer data. Its core task is to convert what is typically considered a word (a single Chinese character or phrase is considered a word) into a positive integer, essentially turning a text into a sequence.

[0060] For the preprocessing of speech data corresponding to text data, in some embodiments, the acoustic processing module commonly found in neural networks is typically used to process the audio signal corresponding to the speech data, extract spectral features from the audio signal, and enable the feature extraction model to extract the acoustic features therein.

[0061] Step 120: Extract features from the speech data and text data included in the first document data using a feature extraction model to obtain acoustic feature vectors and text feature vectors, and fuse the acoustic feature vectors and text feature vectors to obtain the first acoustic semantic vector corresponding to the first document data; the first document data can be any document data.

[0062] In some embodiments, the feature extraction model can be trained on massive amounts of sample document data, or it can be optimized through self-supervised learning after the initial neural network. It is capable of extracting features from the input speech and text data, generating corresponding acoustic feature vectors and text feature vectors. The model architecture of the feature extraction model may include, but is not limited to, RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), CNN (Convolutional Neural Networks), and Transformer models.

[0063] The electronic device can input the speech data and text data included in the first document data into a feature extraction model for feature extraction, thereby extracting acoustic feature vectors and text feature vectors corresponding to the first document data. The acoustic feature vector contains the hidden state corresponding to each token (tag) in the speech data, and the text feature vector contains the hidden state corresponding to each token in the text data. The acoustic feature vector and the text feature vector are fused to obtain the first acoustic semantic vector corresponding to the first document data. The first acoustic semantic vector integrates features at both the acoustic and semantic levels.

[0064] In some embodiments, to ensure that all latent states of the acoustic and text feature vectors are evenly distributed in the semantic space, and that the latent states of similar documents are close to each other, the feature extraction model can first fuse the acoustic and text feature vectors, then perform average pooling on the fused vector, and finally use the average-pooled vector as the first acoustic semantic vector, so that the first acoustic semantic vector can contain the acoustic and semantic information of the entire document. Optionally, performing average pooling on the fused vector can be done by averaging the word vectors contained in the fused vector to obtain a mean vector, and then using the mean vector as the first acoustic semantic vector. Using average pooling can take into account the information of all word vectors in the feature vector, and better reflect the average response of the entire feature vector.

[0065] In some embodiments, acoustic feature vectors and text feature vectors can be fused using a point-by-point addition method. If the acoustic feature vectors and text feature vectors have the same dimension, the vector elements at the same position can be directly added to obtain the first acoustic semantic vector. If the acoustic feature vectors and text feature vectors have different dimensions, they can be transformed into vectors of the same dimension through linear transformation, and then the vector elements at the same position can be added to obtain the first acoustic semantic vector. Optionally, the acoustic feature vectors and text feature vectors can also be fused using vector concatenation. The acoustic feature vectors and text feature vectors can be directly concatenated. Furthermore, based on linear mapping, the concatenated vector can be mapped to the same dimension as the acoustic feature vectors and text feature vectors to obtain the first acoustic semantic vector. It should be noted that the methods for fusing acoustic feature vectors and text feature vectors are not limited to the two methods mentioned above; other fusion methods can also be used, which are not limited here.

[0066] Step 130: Cluster the multiple document data according to the first acoustic semantic vectors corresponding to the multiple document data to obtain multiple document categories.

[0067] Clustering is the process of dividing a dataset into different classes or clusters based on a specific criterion (such as distance), maximizing the similarity of data objects within the same cluster while maximizing the dissimilarity of data objects in different clusters. Commonly used clustering algorithms include k-means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering points to identify the clustering structure), and hierarchical clustering algorithms.

[0068] The acoustic and textual feature vectors for each document are calculated separately using a feature extraction model. These vectors are then fused to obtain a first acoustic semantic vector for each document. Based on this first acoustic semantic vector, cluster analysis is performed on all documents to obtain multiple document categories. Each document category corresponds one-to-one with a cluster, and each category contains one or more documents. Documents within each category can be considered to be of the same or similar type.

[0069] Step 140: Determine the target topic corresponding to each document category based on the document data contained in each document category.

[0070] By clustering similar document data into the same document category, all document data contained in the same document category can correspond to the same target topic.

[0071] In some embodiments, each document category can be input into an LDA (Latent Dirichlet Allocation) model for topic prediction to obtain the target topic corresponding to each document category. Alternatively, the frequency of each word in the document data contained in each document category can be calculated separately, and the words with the highest frequency can be used as the target topic of the document category.

[0072] In this embodiment, an acoustic feature vector and a text feature vector are extracted using a feature extraction model. Then, a first acoustic semantic vector, which integrates the acoustic and text feature vectors, is used for topic clustering. By mining and combining acoustic and semantic understanding within the document, the quality and interpretability of the topics are improved. The clustering method is then used to obtain the target topic corresponding to the document, making the target topic closer to the document's true topic, thereby improving the accuracy of document topic representation.

[0073] like Figure 2 As shown, in one embodiment, a topic modeling method is provided, which can be applied to the above-described electronic device. The method may include the following steps:

[0074] Step 202: Obtain data from multiple documents.

[0075] The document data includes text data and corresponding audio data.

[0076] Step 204: Extract features from the speech data and text data included in the first document data using a feature extraction model to obtain acoustic feature vectors and text feature vectors, and fuse the acoustic feature vectors and text feature vectors to obtain the first acoustic semantic vector corresponding to the first document data; the first document data can be any document data.

[0077] The descriptions of steps 202 to 204 can be found in the descriptions of steps 110 to 120 in the above embodiments, and will not be repeated here.

[0078] Step 206: Map the first acoustic semantic vectors corresponding to the multiple document data from the first vector space to the second vector space to obtain the second acoustic semantic vectors corresponding to the multiple document data; the dimension of the first vector space is greater than that of the second vector space.

[0079] The first acoustic semantic vector refers to the vector obtained by fusing acoustic feature vectors and text feature vectors in the first vector space; the second acoustic semantic vector refers to the vector that maps the first acoustic semantic vector from the first vector space to the second vector space by reducing its dimension.

[0080] In some embodiments, dimensionality reduction techniques can be used to map the first acoustic semantic vectors corresponding to multiple document data from a higher-dimensional first vector space to a lower-dimensional second vector space, thereby obtaining the second acoustic semantic vectors corresponding to the multiple document data. Commonly used dimensionality reduction techniques include PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), and Isomap (Isometric Mapping).

[0081] In some embodiments, UMAP (Uniform Manifold Approximation and Projection) can be used to analyze the key structure of the first acoustic semantic vector, and then map the first acoustic semantic vector to a low-dimensional vector space while preserving the high-dimensional local features. UMAP is a dimensionality reduction technique that assumes that the available data samples are uniformly distributed in the topological space, and can approximate and map these finite data samples to a low-dimensional space. The main steps are: learning the manifold structure in the high-dimensional space and finding the low-dimensional representation of the manifold. Preserving the high-dimensional local structure while reducing the dimensionality of the first acoustic semantic vector can mitigate the impact of excessively high-dimensional feature vectors on clustering results.

[0082] Step 208: Cluster the second acoustic semantic vectors corresponding to multiple document data according to the density clustering algorithm to obtain multiple document categories, and the multiple document categories correspond one-to-one with the multiple clusters obtained by clustering.

[0083] In the second vector space, based on the density clustering algorithm, the second acoustic semantics, which integrates acoustic feature vectors and text feature vectors, are clustered to divide multiple document categories. Each document category represents a different topic containing multiple document data, and the multiple document categories correspond one-to-one with the multiple clusters obtained by clustering.

[0084] Density clustering algorithms determine the cluster structure based on the density of sample distribution. They use the number of points in a certain neighborhood as a standard for connectivity and continuously expand the clusters based on this connectivity to obtain the final clustering result. They can handle irregularly shaped classes, making the clustering of multiple document data more flexible.

[0085] DBSCAN is a typical density-based clustering algorithm, and it is an unsupervised clustering algorithm. Unsupervised means that it does not use pre-labeled targets to cluster data points. It can replace popular clustering algorithms such as k-means and hierarchical clustering. DBSCAN does not require specifying the number of clusters, avoids outliers, and works very well in clusters of arbitrary shape and size.

[0086] In some embodiments, using DBSCAN to cluster second acoustic semantics can discover clusters of arbitrary shapes, increasing the flexibility of document clustering and making document data in the same document category more accurate through clustering, thereby improving the accuracy of document topic representation.

[0087] In some specific embodiments, after clustering the second acoustic semantic vectors corresponding to multiple document data according to a density clustering algorithm to obtain multiple document categories, a first average distance between each document data contained in each document category and a second average distance between each document category and its neighboring document categories can be calculated. A silhouette coefficient is then calculated based on the first and second average distances to evaluate the clustering performance. Neighboring document categories refer to other document categories that cluster near the current document category after clustering.

[0088] The silhouette coefficient can be calculated using formula (1) to evaluate the clustering performance of the clustering results:

[0089]

[0090] Where δ represents the silhouette coefficient, a represents the first average distance, and b represents the second average distance. The silhouette coefficient ranges from [-1, 1]. The closer the documents of the same document category are to each other, and the farther away they are from documents of different document categories, the higher the silhouette coefficient for that document category.

[0091] Step 210: Combine all document data contained in the first document category into a single long document; the first document category can be any document category.

[0092] Density clustering algorithms identify documents belonging to the same document category that share similar second acoustic semantic vectors, meaning they have similar acoustic and textual feature vectors. This suggests that the topics of these documents are likely similar. To facilitate identifying the topics of each document category, multiple documents belonging to the same category are merged into a single long document.

[0093] In some embodiments, all document data contained in the first document category can be sorted according to the sequence length of the document data and concatenated one by one to obtain a long document.

[0094] Step 212: Determine the target topic corresponding to the first document category based on the word frequencies of each word contained in the long document corresponding to the first document category.

[0095] As a specific implementation method, the first document category after clustering is treated as a single long document. Based on the term frequencies of each word contained in the long document corresponding to the first document category, c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) is applied to the long document corresponding to the first document category to obtain the most representative words in the first document category. These most representative words are then used as the target topic corresponding to the first document category. The most representative words refer to words with high frequencies in the first document category but low frequencies in other document categories besides the first document category.

[0096] In some specific embodiments, the word frequency of the i-th word in the long documents corresponding to the first document category is divided by the total word frequency of the i-th word in all long documents corresponding to all document categories to obtain a first quotient. The total number of documents is then divided by the sum of word frequencies, and the logarithm of the quotient is taken to obtain a second quotient. The topic probability of the i-th word in the long documents corresponding to the first document category is obtained by multiplying the first and second quotients. Here, the total number of documents refers to the total number of document data, the sum of word frequencies refers to the sum of word frequencies of the j-th word that is different from the i-th word in the long documents corresponding to the first document category, and the topic probability refers to the likelihood that a word is the target topic of the corresponding document category. After obtaining the topic probabilities of all words in the long documents corresponding to the first document category, the word with the highest topic probability can be taken as the target topic of the first document category.

[0097] Specifically, the topic probability of the i-th word in the long document corresponding to the first document category can be calculated using formula (2):

[0098]

[0099] Among them, ti w represents the word frequency of the i-th word in the long document corresponding to the first document category. i Let t represent the total word frequency of the i-th word across all document categories in the long documents they correspond to, where m and n represent the total number of document data and the total number of document categories, respectively. j This represents the word frequency of the j-th word, which is different from the i-th word.

[0100] c-TF-IDF is a commonly used weighted technique for information retrieval and text mining. The importance of a word increases proportionally to its frequency in the current document, but decreases inversely proportionally to its frequency across all documents. The main idea is that if a word appears frequently in a long document but rarely in other long documents, it is considered to have good category-discriminating ability and is suitable for classification.

[0101] In this embodiment, a feature extraction model is used to extract features from speech and text data to obtain acoustic and text feature vectors for a single document. These two feature vectors are then fused to obtain a first acoustic semantic vector. Next, a dimensionality reduction technique is used to compress the fused first acoustic semantic vector from a high-dimensional space to a low-dimensional space. Based on this, a density clustering algorithm is applied to structurally classify similar document data, resulting in a series of clustered document categories. Finally, based on the word frequency corresponding to each document category, the most representative words in each category are extracted as the document category's theme. This solves the problem that traditional topic modeling techniques lack acoustic and semantic understanding, making it difficult to accurately represent document themes. It can mine and combine acoustic and semantic understanding from documents, making document classification more accurate, increasing the quality and interpretability of the determined target themes, making the target themes closer to the true themes of the documents, accurately determining the target themes of each document category, and improving the accuracy of the determined target theme representation.

[0102] like Figure 3 As shown, in one embodiment, a topic optimization method is provided, which can be applied after the topic modeling method described above. The method may include the following steps:

[0103] Step 302: Calculate the similarity between target topics corresponding to any two current document categories.

[0104] In some embodiments, a topic vector can be formed by combining the target topics corresponding to all current document categories. Each element in the topic vector corresponds to the target topic corresponding to each current document category, and the similarity between the elements in the topic vector is calculated.

[0105] In some embodiments, the similarity between word frequency vectors corresponding to any two current document categories can be calculated based on the word frequency vectors generated when determining the target topic corresponding to the current document category. The word frequency vectors consist of the top five or top ten most frequent words corresponding to each word in each document category; the specific number of words is not limited here. Methods for calculating the similarity between two word frequency vectors include, but are not limited to, cosine similarity, Pearson correlation coefficient, Jaccard similarity coefficient, and log-likelihood similarity / log-likelihood similarity rate.

[0106] Step 304: Merge two document categories with a similarity greater than the similarity threshold, and redetermine the target topic corresponding to the merged document category to obtain updated document categories and target topics corresponding to each document category.

[0107] The similarity threshold refers to the maximum similarity between any two current document categories, with a value ranging from [0,1]. This similarity threshold can be a factory-set threshold for the electronic device or a threshold set manually during the device's use. Specifically, the similarity threshold can be 40%. That is, if the similarity between the target topics corresponding to two current document categories is greater than 40%, the two document categories are considered similar and merged; if the similarity between the target topics corresponding to two current document categories is less than or equal to 40%, the two document categories are considered dissimilar and not merged.

[0108] By comparing the similarity between the target topics corresponding to any two current document categories with the similarity threshold, document categories with similar target topics (i.e., those with similarity greater than the similarity threshold) are merged to obtain multiple new document categories, and the target topics corresponding to the new multiple document categories are recalculated.

[0109] Step 306: Take the updated multiple document categories as the new current document categories, and re-execute the step of calculating the similarity between the target topics corresponding to any two current document categories until the similarity between the target topics corresponding to any two current document categories is no greater than the similarity threshold.

[0110] In some embodiments, the steps of calculating the similarity between target topics corresponding to any two current document categories, merging two document categories with similarity greater than a similarity threshold, and redetermining the target topics corresponding to the merged document categories can be repeated to obtain updated multiple document categories and target topics corresponding to each document category, until the similarity between target topics corresponding to any two current document categories is less than or equal to the similarity threshold.

[0111] In this embodiment, by comparing the similarity between the target topics corresponding to any two current document categories and a similarity threshold, document categories with similar target topics are merged to obtain multiple new document categories. The target topics corresponding to these new document categories are then recalculated. The similarity is repeatedly calculated and the similarity threshold is compared to update the document categories until no similar target topics exist for the current document categories. Document topic optimization is achieved by adjusting document categories based on similarity, making the target topics closer to the true topics of the documents, thereby accurately determining the target topics of each document category and improving the accuracy of the determined target topic representation.

[0112] like Figure 4 As shown, in one embodiment, a training process for a feature extraction model is provided, which can be applied to the topic modeling method described above. This method may include the following steps:

[0113] Step 402: Load the model parameters obtained by pre-training through the sample corpus set to construct the pre-trained feature extraction model.

[0114] In some embodiments, a pre-trained feature extraction model can be constructed by initializing a neural network based on model parameters trained on a large-scale Chinese corpus. The large-scale Chinese corpus can be downloaded from publicly available online databases such as THUCNews and THUOCL (THU Open Chinese Lexicon). The downloaded sample data is input into the initialized neural network for training and initial adaptive parameter adjustments to obtain the model parameters of the pre-trained feature extraction model. These model parameters are then substituted into the neural network model framework to obtain the pre-trained feature extraction model.

[0115] Step 404: Obtain multiple sample speech data and the corresponding sample text data for each sample speech data.

[0116] Step 406: Using the pre-trained feature extraction model, calculate the first loss based on the sample speech data, calculate the second loss based on the sample speech data and the corresponding sample text data, determine the target loss based on the first loss and the second loss corresponding to the sample speech data, and adjust the parameters of the pre-trained feature extraction model according to the gradient descent direction of the target loss until the model convergence condition is met, thus obtaining the trained feature extraction model.

[0117] In simple terms, Gradient Descent (GD) is a method for finding the minimum objective function. It utilizes gradient information and iteratively adjusts parameters to find a suitable target value, i.e., it seeks the minimum value along the direction of gradient descent, which can make the iterative solution of model parameters more efficient. The adjusted parameters may include the weights of the feature extraction model.

[0118] In some embodiments, the parameters of the pre-trained feature extraction model can be updated according to the gradient descent direction of the target loss to obtain model parameters that satisfy the model convergence condition, and the feature extraction model can be constructed based on the model parameters. Model parameters that satisfy the model convergence condition may include those whose target loss value no longer decreases after two or more consecutive model parameter updates.

[0119] In some embodiments, such as Figure 5 As shown, the steps include calculating a first loss based on the sample speech data, calculating a second loss based on the sample speech data and the corresponding sample text data, and determining the target loss based on the first loss and the corresponding second loss for the sample speech data. These steps may include the following:

[0120] Step 502: The speech recognition module of the pre-trained feature extraction model performs speech recognition on the sample speech data to obtain the transcription labels corresponding to the sample speech data.

[0121] The speech recognition module utilizes Automatic Speech Recognition (ASR) technology, aiming to convert the lexical content of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. Transcription is one of the fundamental functions of the speech recognition module; essentially, it's statistical pattern recognition that converts audio files into text data.

[0122] In some embodiments, the speech recognition module comprises four main parts: feature extraction, acoustic model, language model, and dictionary and decoding. To extract features more effectively, preprocessing such as filtering and framing of the sample speech data is often required to extract the signal to be analyzed from the original signal. Then, feature extraction converts the sound signal from the time domain to the frequency domain, providing suitable feature vectors for the acoustic model. The acoustic model then calculates the score of each feature vector based on acoustic features. The language model, based on linguistic theories, calculates the probability of possible word sequences corresponding to the sound signal. Finally, using an existing dictionary, the word sequences are decoded to obtain the final possible text representation, which is the transcription label q corresponding to the sample speech data. t The relationship between the acoustic model and the language model can be expressed by Bayes' theorem.

[0123] Step 504: Calculate the first loss based on the sample speech data and the corresponding transcription tags.

[0124] As a specific implementation method, when the speech recognition module performs speech recognition on the sample speech data, it calculates all possible transcription tags q corresponding to the sample speech data. t The probability P(q) t X), based on the path lengths corresponding to all possible transcription tags, multiply the probabilities of transcription tags corresponding to the same path length, and then sum the products obtained from each path length to obtain the first loss corresponding to the pre-trained feature extraction model.

[0125] Specifically, the first loss L corresponding to the pre-trained feature extraction model can be calculated using formula (3). ASR :

[0126]

[0127] Where Q is the set of all valid CTC paths and q t ∈Q, T represents the length of a single path, X and Y represent the sample speech data and its corresponding transcription tag sequence, respectively, and B represents the mapping relationship of the CTC path.

[0128] Connectionist Temporal Classification (CTC) is a method that avoids manual alignment of input and output. It is suitable for applications such as speech recognition or OCR (Optical Character Recognition). Given an input sequence X and corresponding label data Y, such as audio files and text files in speech recognition, CTC finds a mapping from X to Y. For a given input sequence X, CTC provides the output distribution of all possible Y.

[0129] Step 506: Fuse the sample speech data and the corresponding sample text data to obtain fused data.

[0130] In some embodiments, sample speech data and corresponding sample text data can be fused using a point-by-point addition method; alternatively, vector concatenation can be used. The point-by-point addition method presupposes that the sample speech data and corresponding sample text data have the same dimension. If the dimensions are the same, corresponding elements are added directly. If the dimensions are different, a linear transformation can be used to convert the sample speech data and corresponding sample text data into data of the same dimension before adding corresponding elements. The vector concatenation method concatenates the sample speech data and corresponding sample text data. After concatenation, a linear mapping is typically used to ensure that the fused data shares the same dimension as the sample speech data and corresponding sample text data, enabling the pre-trained feature extraction model to process and extract features.

[0131] Step 508: Extract features from the fused data using a pre-trained feature extraction model to obtain the feature vector of the fused data, and calculate the second loss based on the feature vector.

[0132] The fused data, which combines sample speech data and corresponding sample text data, is used to extract features through a pre-trained feature extraction model. The resulting features are then used to calculate the second loss corresponding to the pre-trained feature extraction model through a loss function.

[0133] In some embodiments, a second loss can be calculated using a loss function based on the prediction results and the topic tags corresponding to manually defined sample speech data. Suitable loss functions include mean squared error loss, cross-entropy loss, and L1 loss.

[0134] In some specific embodiments, to maximize the probability and match the feature distribution of the current fused data, the cross-entropy loss function can be used to calculate the second loss corresponding to the pre-trained feature extraction model. Specifically, the second loss L corresponding to the pre-trained feature extraction model can be calculated using the cross-entropy loss function formula (4). ce :

[0135]

[0136] Among them, Y i Let X represent the predicted value corresponding to the i-th sample speech data. i The term "topic label" represents the topic label corresponding to the i-th sample speech data, which is defined manually, and N represents the number of sample speech data.

[0137] Step 510: Determine the target loss based on the first loss and the corresponding second loss of the sample speech data.

[0138] In some embodiments, the second loss can be multiplied by the maximum path length to obtain the first value; the maximum path length is the maximum path length when the speech recognition module generates the transcription label corresponding to the sample speech data; the first value and the first loss are weighted and summed to obtain the target loss.

[0139] As a specific implementation method, the maximum path length can be first taken as a square root, and then the maximum path length after taking the square root can be multiplied by the second loss to obtain a first value. According to the hyperparameters (1-α) and α corresponding to the first loss and the first value respectively, the first loss and the first value are weighted and summed to obtain the target loss corresponding to the pre-trained feature extraction model.

[0140] Specifically, the target loss L corresponding to the pre-trained feature extraction model can be calculated using formula (5). total :

[0141]

[0142] Among them, L ASR Indicates the first loss, L ce Indicates the second loss, T max This represents the maximum path length, and α represents the adjustment of L. ce With L ASR The weights between them and the hyperparameters with α less than 1.

[0143] In this embodiment, a pre-trained feature extraction model is trained using a sample corpus, enabling the model to learn a large amount of semantic information. A first loss is calculated using sample speech data, and a second loss is calculated using fused data of the sample speech data and corresponding sample text data. A target loss is determined based on the first and second losses corresponding to the sample speech data. The pre-trained feature extraction model is then fine-tuned according to the gradient descent direction of the target loss, adjusting the model parameters to meet the model convergence condition. Finally, the feature extraction model is constructed based on the model parameters that finally meet the convergence condition. The training of the feature extraction model is achieved by updating the model parameters along the gradient descent direction of the target loss, increasing the efficiency of iterative solution for model parameters. This makes the trained feature extraction model more closely aligned with maximizing probabilities, effectively matching the feature distribution of the data, and mining and combining acoustic and semantic understanding within the document, resulting in more accurate document classification.

[0144] like Figure 6As shown, in one embodiment, a topic modeling apparatus 600 is provided, which can be applied to the above-described electronic device. The topic modeling apparatus 600 may include a data acquisition module 610, a feature extraction module 320, a clustering module 630, and a topic determination module 640.

[0145] The data acquisition module 610 is used to acquire multiple document data, including text data and corresponding voice data.

[0146] The feature extraction module 620 is used to extract features from the speech data and text data included in the first document data through a feature extraction model to obtain acoustic feature vectors and text feature vectors, and to fuse the acoustic feature vectors and text feature vectors to obtain the first acoustic semantic vector corresponding to the first document data; the first document data is any document data.

[0147] Clustering module 630 is used to cluster multiple document data according to the first acoustic semantic vectors corresponding to the multiple document data respectively, so as to obtain multiple document categories;

[0148] The topic determination module 640 is used to determine the target topic corresponding to each document category based on the document data contained in each document category.

[0149] In some embodiments, the clustering module 630 is further configured to map the first acoustic semantic vectors corresponding to the multiple document data from the first vector space to the second vector space to obtain the second acoustic semantic vectors corresponding to the multiple document data respectively; the dimension of the first vector space is greater than that of the second vector space; and is further configured to cluster the second acoustic semantic vectors corresponding to the multiple document data respectively according to the density clustering algorithm to obtain multiple document categories, and the multiple document categories correspond one-to-one with the multiple clusters obtained by clustering.

[0150] As an optional implementation, the topic determination module 640 is further configured to treat all document data contained in the first document category as a long document; the first document category is any document category; and determine the target topic corresponding to the first document category based on the word frequency of each word contained in the long document corresponding to the first document category.

[0151] In some embodiments, the topic modeling apparatus 600 further includes a topic optimization module.

[0152] The topic optimization module is used to calculate the similarity between target topics corresponding to any two current document categories; merge two document categories with similarity greater than the similarity threshold, and redetermine the target topics corresponding to the merged document categories to obtain updated document categories and their corresponding target topics; use the updated document categories as the new current document categories, and re-execute the step of calculating the similarity between target topics corresponding to any two current document categories until the similarity between target topics corresponding to any two current document categories is no greater than the similarity threshold.

[0153] Optionally, the subject modeling apparatus 600 also includes a model training module.

[0154] The model training module is used to load the model parameters obtained by pre-training through a sample corpus to construct a pre-trained feature extraction model; acquire multiple sample speech data and corresponding sample text data for each sample speech data; calculate the first loss based on the sample speech data and the second loss based on the sample speech data and the corresponding sample text data through the pre-trained feature extraction model; determine the target loss based on the first loss and the second loss corresponding to the sample speech data; and adjust the parameters of the pre-trained feature extraction model according to the gradient descent direction of the target loss until the model convergence condition is met, thus obtaining the trained feature extraction model.

[0155] As an optional implementation, the model training module is also used to perform speech recognition on the sample speech data through the speech recognition module of the pre-trained feature extraction model to obtain the transcription labels corresponding to the sample speech data; calculate the first loss based on the sample speech data and the corresponding transcription labels; fuse the sample speech data and the corresponding sample text data to obtain fused data; extract features from the fused data through the pre-trained feature extraction model to obtain the feature vector of the fused data, and calculate the second loss based on the feature vector.

[0156] As an optional implementation, the model training module is also used to multiply the second loss by the maximum path length to obtain the first value; the maximum path length is the maximum path length when the speech recognition module generates the transcription label corresponding to the sample speech data; the first value and the first loss are weighted and summed to obtain the target loss.

[0157] In this embodiment, the acoustic and textual features of the document data can be extracted by the feature extraction model, and the document data is clustered by the first acoustic semantic vector that integrates the acoustic and textual features. By mining and combining the acoustic and semantic understanding in the document, the document classification is more accurate, and the quality and interpretability of the determined target topic are improved, making the target topic closer to the real topic of the document. The target topic of each document category is accurately determined, and the accuracy of the determined target topic representation is improved.

[0158] Figure 7 This is a structural block diagram of an electronic device in one embodiment. The electronic device can be a mobile phone, tablet computer, smart wearable device, etc. Figure 7 As shown, the electronic device 700 may include one or more of the following components: a processor 710 and a memory 720 coupled to the processor 710, wherein the memory 720 may store one or more computer programs, which may be configured to implement the methods described in the above embodiments when executed by one or more processors 710.

[0159] The processor 710 may include one or more processing cores. The processor 710 connects to various parts within the electronic device 700 using various interfaces and lines, and performs various functions and processes data of the electronic device 700 by running or executing instructions, programs, code sets, or instruction sets stored in the memory 720, and by calling data stored in the memory 720. Optionally, the processor 710 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 710 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 710 and may be implemented separately using a communication chip.

[0160] The memory 720 may include random access memory (RAM) or read-only memory (ROM). The memory 720 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 720 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the various method embodiments described above. The data storage area may also store data created by the electronic device 700 during use.

[0161] Understandably, the electronic device 700 may include more or fewer structural elements than those shown in the block diagram above, such as power supply, input buttons, camera, speaker, screen, RF (Radio Frequency) circuit, Wi-Fi (Wireless Fidelity) module, Bluetooth module, sensor, etc., and may not be limited herein.

[0162] This application discloses a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the methods described in the above embodiments.

[0163] This application discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program can be executed by a processor to implement the methods described in the above embodiments.

[0164] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), etc.

[0165] Any references to memory, storage, databases, or other media used herein may include non-volatile and / or volatile memory. Suitable non-volatile memory may include ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which is used as an external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchlink DRAM (SLDRAM), Rambus DRAM (RDRAM), and Direct Rambus DRAM (DRDRAM).

[0166] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Those skilled in the art should also recognize that the embodiments described in the specification are optional embodiments, and the actions and modules involved are not necessarily essential to this application.

[0167] In the various embodiments of this application, it should be understood that the sequence number of each process does not necessarily imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0168] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they can be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0169] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0170] If the aforementioned integrated units are implemented as software functional units and sold or used as independent products, they can be stored in a computer-accessible memory. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several requests to cause a computer device (which can be a personal computer, server, or network device, specifically a processor in the computer device) to execute some or all of the steps of the methods described in the various embodiments of this application.

[0171] The foregoing has provided a detailed description of a subject modeling method, apparatus, electronic device, and computer-readable storage medium disclosed in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A topic modeling method, characterized in that, The method includes: Acquire multiple document data, the document data including text data and corresponding voice data; The first document data includes speech data and text data, which are extracted by a feature extraction model to obtain acoustic feature vectors and text feature vectors. The acoustic feature vectors and text feature vectors are then fused to obtain the first acoustic semantic vector corresponding to the first document data. The first document data can be any of the document data mentioned above. Based on the first acoustic semantic vectors corresponding to the multiple document data respectively, the multiple document data are clustered to obtain multiple document categories; Based on the document data contained in each document category, determine the target topic corresponding to each document category; The training process of the feature extraction model includes: Load the model parameters obtained from pre-training using a sample corpus set to construct a pre-trained feature extraction model; Acquire multiple sample speech data and the corresponding sample text data for each of the sample speech data; The speech recognition module of the pre-trained feature extraction model performs speech recognition on the sample speech data to obtain the transcription labels corresponding to the sample speech data; Calculate the first loss based on the sample speech data and the corresponding transcription tags; The sample speech data and the corresponding sample text data are fused to obtain fused data; The fused data is subjected to feature extraction using the pre-trained feature extraction model to obtain a feature vector of the fused data, and a second loss is calculated based on the feature vector. The first value is obtained by multiplying the second loss by the square root of the maximum path length; the maximum path length is the maximum path length when the speech recognition module generates the transcription label corresponding to the sample speech data. The first value and the first loss are weighted and summed to obtain the target loss. The parameters of the pre-trained feature extraction model are adjusted according to the gradient descent direction of the target loss until the model convergence condition is met, thus obtaining the trained feature extraction model.

2. The method according to claim 1, characterized in that, The step of clustering the multiple document data according to the first acoustic semantic vectors corresponding to each document data to obtain multiple document categories includes: The first acoustic semantic vectors corresponding to the multiple document data are mapped from the first vector space to the second vector space to obtain the second acoustic semantic vectors corresponding to the multiple document data; the dimension of the first vector space is larger than that of the second vector space. The second acoustic semantic vectors corresponding to the multiple document data are clustered according to the density clustering algorithm to obtain multiple document categories, and the multiple document categories correspond one-to-one with the multiple clusters obtained by clustering.

3. The method according to claim 1, characterized in that, The step of determining the target topic corresponding to each document category based on the document data contained in each document category includes: All document data contained in the first document category are treated as a single long document; the first document category can be any of the document categories described above. The target topic corresponding to the first document category is determined based on the word frequency of each word contained in the long document corresponding to the first document category.

4. The method according to claim 3, characterized in that, After determining the target topic corresponding to each document category, the method further includes: Calculate the similarity between target topics corresponding to any two current document categories; Two document categories with a similarity greater than a similarity threshold are merged, and the target topics corresponding to the merged document categories are redefined to obtain updated document categories and target topics corresponding to each document category. The updated document categories are used as the new current document categories. The step of calculating the similarity between target topics corresponding to any two current document categories is repeated until the similarity between target topics corresponding to any two current document categories is no greater than the similarity threshold.

5. A subject modeling device, characterized in that, The device includes: The data acquisition module is used to acquire multiple document data, including text data and voice data corresponding to the text data; The feature extraction module is used to extract features from the speech data and text data included in the first document data using a feature extraction model, to obtain acoustic feature vectors and text feature vectors, and to fuse the acoustic feature vectors and text feature vectors to obtain a first acoustic semantic vector corresponding to the first document data; the first document data is any of the document data mentioned above. The clustering module is used to cluster the multiple document data according to the first acoustic semantic vectors corresponding to the multiple document data respectively, so as to obtain multiple document categories; The topic determination module is used to determine the target topic corresponding to each document category based on the document data contained in each document category; The model training module is used to load model parameters obtained through pre-training using a sample corpus to construct a pre-trained feature extraction model; acquire multiple sample speech data and corresponding sample text data; perform speech recognition on the sample speech data using the speech recognition module of the pre-trained feature extraction model to obtain transcription labels corresponding to the sample speech data; calculate a first loss based on the sample speech data and corresponding transcription labels; fuse the sample speech data and corresponding sample text data to obtain fused data; extract features from the fused data using the pre-trained feature extraction model to obtain feature vectors of the fused data, and calculate a second loss based on the feature vectors; multiply the second loss by the square root of the maximum path length to obtain a first value; the maximum path length is the maximum path length when the speech recognition module generates the transcription labels corresponding to the sample speech data; perform a weighted summation of the first value and the first loss to obtain a target loss; adjust the parameters of the pre-trained feature extraction model according to the gradient descent direction of the target loss until the model convergence condition is met to obtain the trained feature extraction model.

6. An electronic device, characterized in that, The system includes a memory and a processor, wherein the memory stores a computer program that, when executed by the processor, causes the processor to perform the method as described in any one of claims 1 to 4.

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1 to 4.