Dual-target self-supervised medical question text clustering method and system with multi-feature fusion
By employing a dual-objective self-supervised method that integrates multiple features, and combining word frequency and lexical semantic information with BiGRU and self-attention mechanisms, the problem of feature sparsity and non-end-to-end representation of medical problem texts is solved, achieving more efficient text clustering results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF MEDICAL INFORMATION CHINESE ACAD OF MEDICAL SCI
- Filing Date
- 2023-05-15
- Publication Date
- 2026-06-19
AI Technical Summary
Existing text clustering methods suffer from sparse features and non-end-to-end feature representation in medical problem texts, making it difficult to effectively utilize text structure dependencies and semantic information.
A dual-objective self-supervised method with multi-feature fusion is adopted. By extracting word frequency information and lexical semantic information, and combining BiGRU model and self-attention mechanism, cross-document topic information objective function and clustering objective function are constructed to achieve end-to-end text clustering.
It improves the feature representation capabilities of medical issue texts, enhances clustering accuracy and network generalization ability, and achieves more user-friendly text clustering results.
Smart Images

Figure CN116543406B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of text processing technology, and in particular to a method and system for clustering medical problem texts with multiple features fusion and a dual-objective self-supervised approach. Background Technology
[0002] Short text clustering is a challenging and important task in many practical applications. However, many bag-of-words-based short text clustering methods are often limited by the sparsity of text representations, while neural network-based word embedding methods cannot capture the document structure dependencies in text corpora. Furthermore, traditional text clustering algorithms separate text representation from the clustering process. Self-supervised learning, a special form of unsupervised learning, is designed for unlabeled data. By using the clustering results as the target for self-supervised training to guide deep networks to learn better representations, and through joint training of deep networks and the clustering process, the learned high-quality text representations help improve the performance of clustering algorithms, increase clustering accuracy and network generalization ability, and change the non-end-to-end nature of feature representation and clustering algorithms in traditional clustering frameworks.
[0003] Medical question-and-answer data is in unstructured text format, characterized by short characters and non-standard terminology. Features extracted by a single model often fail to fully represent the text content. Therefore, it is necessary to combine other models to construct fused features, enriching the semantic information of the feature vectors for better text representation. Thus, how to integrate more information into the question text and simultaneously construct an end-to-end text clustering model, specifically a multi-feature fusion-based, dual-objective, self-supervised medical question text clustering model, has become a pressing technical problem to be solved. Summary of the Invention
[0004] To address the aforementioned problems, the purpose of this invention is to provide a multi-feature fusion-based bi-objective self-supervised medical problem text clustering method and system, which can effectively improve the feature representation capability of problem texts.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: a multi-feature fusion dual-objective self-supervised medical question text clustering method, which includes: extracting word frequency information and lexical semantic information from medical question text data sources, fusing word frequency information and lexical semantic information to obtain weighted word vectors; using the weighted word vectors as input to a BiGRU model, performing deep learning and introducing a self-attention mechanism, calculating the weight of each word in the medical question text, and obtaining the semantic relationship between words; constructing a cross-document topic information objective function based on the semantic relationship between words, and fusing the clustering objective function into the network loss function of the learned representation, constructing a joint loss function with the cross-document topic information objective function, and fusing the text representation and clustering results into a unified framework to achieve end-to-end dual-objective self-supervised text clustering.
[0006] Furthermore, word frequency information and lexical semantic information are fused to obtain weighted word vectors, including:
[0007] The lexical semantic weights of the fast text classifier fastText are weighted by the term frequency-inverse text frequency index (TF-IDF) to reduce the influence of non-keyword features, improve topic discrimination ability, and obtain the weight features of words in medical question texts.
[0008] The weighted word vectors are obtained by multiplying the semantic features of the words with the corresponding TF-TDF values.
[0009] Furthermore, before extracting word frequency and semantic information from the medical question text data source, the process also includes steps to process the medical question text data:
[0010] The medical problem text data was cleaned to remove meaningless noise data, and the problems of misuse and incorrect use of punctuation marks were corrected.
[0011] After data cleaning, the medical question text data is segmented into words, and then categorized.
[0012] Furthermore, weighted word vectors are used as input to the BiGRU model for deep learning, and a self-attention mechanism is introduced, including:
[0013] BiGRU is used as the nonlinear function f1 to obtain information about long-term dependent neurons. Each GRU unit is equipped with two sigmoid gates: a reset gate and an update gate. The dependency relationship between adjacent words in a single question text is obtained through a bidirectional structure.
[0014] By introducing a self-attention mechanism on top of BiGRU, different aspects of a sentence can be extracted into multiple vector representations; more information can be obtained without any additional input, and the hidden representations at all times can be directly accessed.
[0015] Furthermore, the multifaceted mechanisms of self-attention include:
[0016]
[0017]
[0018] In the formula, W1 and W2 are linear layers with learnable parameters, tanh is used as an activation layer to introduce a nonlinear transformation, T is the transpose of the matrix, H is the output feature of the GRU, and Z represents the output feature of the linear layer; A is the self-attention matrix, exp represents the exponential function, and each value in A represents the correlation between the feature value and other feature values; by multiplying by the attention matrix A and the input H, the intermediate features enhanced by the self-attention mechanism are obtained. The output features are then calculated through a linear projection layer.
[0019] Furthermore, the objective function for cross-document topic information is:
[0020]
[0021]
[0022] In the formula, L topic For cross-document topic information objective function, KL divergence, d ij ∈D represents the probability that sample i belongs to topic j, f ij Let F represent the probability that sample i belongs to topic j, where F is the document-topic distribution of the text calculated by the topic model LDA, and the subscripts i and j represent sample i and topic j, respectively; D is a probability distribution, D∈R. 1×r , where r is the length of the document-topic distribution vector, ReLU is the linearly modified activation function, and W1 and W2 are linear layers with learnable parameters.
[0023] Furthermore, the joint loss function is:
[0024] L train =αL cluster +L topic
[0025] Among them, L train For the joint loss function, L cluster L is the clustering objective function. topic Let α be the objective function for cross-document topic information, and α be a hyperparameter.
[0026] A multi-feature fusion dual-objective self-supervised medical question text clustering system includes: a first processing module, which extracts word frequency information and lexical semantic information from the medical question text data source, and fuses the word frequency information and lexical semantic information to obtain weighted word vectors; a second processing module, which uses the weighted word vectors as input to a BiGRU model, performs deep learning and introduces a self-attention mechanism, calculates the weight of each word in the medical question text, and obtains the semantic relationship between words; and a third processing module, which constructs a cross-document topic information objective function based on the semantic relationship between words, and integrates the clustering objective function into the network loss function of the learned representation, constructs a joint loss function with the cross-document topic information objective function, and integrates the text representation and clustering results into a unified framework to achieve end-to-end dual-objective self-supervised text clustering.
[0027] A computer-readable storage medium storing one or more programs, the one or more programs including instructions that, when executed by a computing device, cause the computing device to perform any of the methods described above.
[0028] A computing device includes: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing any of the methods described above.
[0029] The present invention has the following advantages due to the adoption of the above technical solutions:
[0030] This invention aims to learn and integrate cross-document topic features to construct an end-to-end text clustering model. It establishes two objective functions, integrates them into a unified clustering framework, and learns a user-friendly representation conducive to text clustering. This addresses the problems of current text clustering algorithms separating text representation from the clustering process, as well as the inherent limitations of medical problem texts, such as their small character count and sparse features. Attached Figure Description
[0031] Figure 1 This is a flowchart of the dual-objective self-supervised medical problem text clustering method with multi-feature fusion in this embodiment of the invention;
[0032] Figure 2 This is a visualization effect of medical problem text clustering results based on the traditional spatial vector model in an embodiment of the present invention;
[0033] Figure 3 This is a visualization of the medical question text clustering results obtained by combining the bag-of-words model and word embedding methods in this embodiment of the invention.
[0034] Figure 4 This is a visualization of the medical problem text clustering results obtained by extracting high-order features and reducing dimensionality through an autoencoder in an embodiment of the present invention.
[0035] Figure 5 This is a visualization of the medical question text clustering results of MF-DSC in an embodiment of the present invention;
[0036] Figure 6 This is a schematic diagram comparing the NMI values of different models under different datasets in this embodiment of the invention. Detailed Implementation
[0037] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the described embodiments of the present invention are within the scope of protection of the present invention.
[0038] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0039] To address the issues of current text clustering algorithms separating text representation from the clustering process, and the limited number of characters and sparse features in medical question texts, this invention provides a dual-target self-supervised clustering method and system for medical question texts using multi-feature fusion. This invention improves the feature representation capability of question texts, changing the non-end-to-end nature of feature representation and clustering algorithms in traditional clustering frameworks. It employs a dual-target self-supervised clustering algorithm (MF-DSC) with multi-feature fusion. First, word weights are obtained based on word frequency information, and word vectors are generated from word semantic information. The word frequency and semantic information are fused to obtain a weighted word vector, which is used as the overall input of the model for deep learning. A self-attention mechanism is introduced to calculate the weight of each word in the question text, i.e., the mutual influence between words. To learn and fuse cross-document topic features while constructing an end-to-end text clustering model, two objective functions are constructed and fused into a unified clustering framework to learn a friendly representation conducive to text clustering.
[0040] In one embodiment of the present invention, a multi-feature fusion-based dual-objective self-supervised medical problem text clustering method is provided. In this embodiment, as shown... Figure 1As shown, the method includes the following steps:
[0041] 1) Extract word frequency information and lexical semantic information from medical problem text data sources, and fuse word frequency information and lexical semantic information to obtain weighted word vectors;
[0042] 2) The weighted word vectors are used as input to the BiGRU model for deep learning and a self-attention mechanism is introduced to calculate the weight of each word in the medical question text and obtain the semantic relationship between words.
[0043] 3) Construct a cross-document topic information objective function based on the semantic relationship between words, and integrate the clustering objective function into the network loss function of the learning representation. Construct a joint loss function with the cross-document topic information objective function, and integrate the text representation and clustering results into a unified framework to achieve end-to-end dual-objective self-supervised text clustering.
[0044] In this embodiment, the text representation is input into the BiGRU model for deep learning of high-quality semantic information, and a self-attention mechanism is introduced to obtain the correlation between words in the question text. To construct an end-to-end text clustering model and obtain a more user-friendly clustering representation, the text representation and clustering results are integrated into a unified framework. The clustering objective function is fused into the network loss function for learning the representation. Simultaneously, to ensure the text focuses on full document information, a cross-document topic information objective function is introduced. These are integrated into a unified framework to construct a joint loss function, achieving dual-objective self-supervision.
[0045] In step 1) above, word frequency features are used to obtain word weights, word semantic features are used to generate word vectors, and the fusion of word frequency information and word semantic information yields weighted word vectors.
[0046] In this embodiment, the specific steps for fusing word frequency information and lexical semantic information to obtain weighted word vectors include the following:
[0047] 1.1.1) The lexical semantics of the fast text classifier fastText are weighted by the term frequency-inverse text frequency index (TF-IDF) to reduce the influence of non-keyword features, improve topic discrimination ability, and obtain the weight features of words in medical question texts.
[0048] 1.1.2) Multiply the semantic features of the words by the corresponding TF-TDF values to obtain weighted word vectors.
[0049] In step 1) above, a medical Q&A community is used as the data source, and Python is used to obtain question titles from the medical Q&A community as the research object. Because the questions in the Q&A community have problems such as semantic ambiguity and unscientific and non-standard expressions, this invention requires data preprocessing. Therefore, before extracting word frequency information and lexical semantic information from the medical question text data source, the invention also includes a step of processing the medical question text data:
[0050] 1.2.1) Clean the medical problem text data to remove meaningless noise data, and correct the problems of misuse or incorrect use of punctuation marks;
[0051] Specifically, minor adjustments will be made to address a few inaccurate or non-standard expressions, a medical thesaurus for liver cancer will be constructed, and professional terminology will be standardized, such as replacing "liver cancer" with "liver cancer".
[0052] 1.2.2) The medical problem text data after data cleaning is segmented into words, and then the categories are labeled.
[0053] In this embodiment, pkuseg is used for word segmentation, supporting domain-specific word segmentation. The clustering effect is evaluated by clearly defining the question text categories. A category system is determined through existing technologies and expert consultation, resulting in a clear structure and prominent domain characteristics. Based on existing classifications of medical and health information and considering the actual situation of medical questions in medical Q&A communities, medical questions are divided into 12 major categories, as shown in Table 1 below. The categories are labeled and reviewed according to the medical question category classification system.
[0054] Table 1. Classification System of Medical Problems
[0055]
[0056] In this embodiment, the fast text classifier fastText has an input layer, a hidden layer, and an output layer. It is an improvement on skip-gram, decomposing each word in the input context based on word n-gram format, which can represent the internal order of words. For example, for the word "primary liver cancer", if the n-gram value is 3, its trigrams are: "primary", "primary", "primary cancer", "liver cancer", and "liver cancer". The word vector of "primary liver cancer" can be represented by the superposition of the five decomposed word vectors. The lexical semantic features generated by the fastText model contain semantic and partial word order information. The question texts in medical Q&A communities have a small vocabulary, and the contextual semantic and word order information of the feature words is significantly lacking. Using the fastText model to extract lexical semantic features, the lexical semantic features of the medical question text D are represented as follows:
[0057]
[0058] Among them, a k The word w in the problem text D k The vector, where S is the vector dimension.
[0059] The semantic weights of words in fastText are weighted using the TF-IDF (Term Frequency-Inverse Text Frequency Index) to reduce the influence of non-keyword features, thereby improving topic discrimination. The weighted features W of words in the question text D are:
[0060] W = [tfidf1, tfidf2, ..., tfidf] k ]
[0061] Among them, tfidf k The word w in the problem text D k The higher the TF-TDF value, the greater the importance.
[0062] The weighted lexical semantic features FW of the problem text D are obtained by multiplying the lexical semantic features by the corresponding TF-TDF values.
[0063]
[0064] in, It is a real matrix.
[0065] In step 2) above, the weighted word vectors are used as input to the BiGRU model for deep learning, and a self-attention mechanism is introduced, including the following steps:
[0066] 2.1) BiGRU is used as the nonlinear function f1 to obtain information about long-term dependent neurons. Each GRU unit is equipped with two Sigmoid gates: a reset gate and an update gate. The dependency relationship between adjacent words in a single question text is obtained through a bidirectional structure.
[0067] 2.2) A self-attention mechanism is introduced on the basis of BiGRU, which allows different aspects of a sentence to be extracted into multiple vector representations; more information can be obtained without any other additional input, and the hidden representations at all times can be directly accessed.
[0068] In this embodiment, the encoder is essentially an RNN that encodes the input sequence into feature representations. For vocabulary sequence prediction, given any sequence of question text as input... k is the number of words in the question text, and S is the dimension of the pre-trained word embeddings, which is also the topic dimension. Then, the encoder is used to learn x at time step t. t to h t mapping Let f1 be the hidden state of the encoder at time t, m be the magnitude of the hidden state, and f1 be a nonlinear function. This invention uses BiGRU as f1 to acquire long-term dependent neuron information. Furthermore, the reason for using GRU units is that neurons sum over a certain time interval, which helps overcome the gradient vanishing problem and better capture long-term correlations in time series. Each GRU unit consists of two sigmoid gates, i.e., reset gates r... t and update gate z t The GRU unit is updated as follows:
[0069] r t =σ(W r [h t-1 ,x t ])
[0070] z t =σ(W z [h t-1 ,x t ])
[0071]
[0072]
[0073] y t =σ(W o h t )
[0074] in, It is the previous hidden state h t-1 and the current input x t The connection. W o To output the weight vector, y t Let the output at time t be obtained after activation by the activation function. Let be the candidate hidden layer state information at time t. σ represents the parameters to be learned, and σ is the sigmoid function. A bidirectional structure is used to obtain the dependencies between adjacent words in a single question text.
[0075]
[0076]
[0077]
[0078] at last and Each horizontal concatenation results in a hidden state h. t Let u be the number of hidden units in each one-way GRU. For simplicity, note that all h... t for Where k is the number of words in the input question text.
[0079] In this embodiment, the self-attention mechanism is a variant of the attention mechanism. Unlike the attention mechanism, its encoder and decoder parts are the same text. It calculates the weight of each word in the text based on the word distribution itself, i.e., the mutual influence between words. This reduces reliance on external information and is better at capturing the internal correlations of data or features. It assigns different weights to each feature, strengthens complementary feature information, and weakens conflicting parts. Most existing technologies use the final hidden state of an RNN or RNN hidden states through max (or average) pooling to create simple vector representations. However, semantics are relatively difficult to achieve across all time steps in a recurrent neural network model, and some lexical semantics are not essential. Unlike existing methods, this invention introduces a self-attention mechanism that allows different aspects of a sentence to be extracted into multiple vector representations. Specifically, in this embodiment, a self-attention mechanism is introduced on top of BiGRU in the proposed sentence embedding model. This allows for the acquisition of more information without additional input, and because it can directly access hidden representations across all time steps, it alleviates some of the long-term memory burden on the encoder.
[0080] Specifically, the multifaceted mechanisms of self-attention are:
[0081] Z = W2·tanh(W1·H) T )
[0082]
[0083]
[0084] In the formula, W1 and W2 are linear layers with learnable parameters, tanh is used as an activation layer to introduce a nonlinear transformation, T is the transpose of the matrix, H is the output feature of the GRU, and Z represents the output feature of the linear layer. The self-attention matrix A is calculated using the Softmax function, where exp represents an exponential function, and each value in A represents the correlation between the feature value and other feature values. The intermediate features enhanced by the self-attention mechanism are obtained by multiplying the attention matrix A and the input H. The output features are then calculated through a linear projection layer.
[0085] Subsequent The output features are then calculated using a linear projection layer, as follows:
[0086]
[0087] In the formula, W1 and W2 are linear layers with learnable parameters, and tanh is used as the activation function to introduce a nonlinear transformation. M∈R 1×r This is the final text representation vector, where r is the vector dimension.
[0088] In step 3) above, in order to construct an end-to-end text clustering model and simultaneously introduce cross-document topic information to obtain a more user-friendly clustering representation, a cross-document topic information objective function L is introduced. topic At the same time, the clustering objective function L cluster By integrating the network loss function into the learning representation, a joint loss function is constructed under a unified framework, thus fusing text representation and clustering results into a unified framework.
[0089] To further learn cross-document-topic information, the document-topic distribution F is used as the self-supervised training objective. Under the assumption that there is a one-to-one mapping relationship between document clusters and topics, the autoencoder BiGRU can learn a relatively clustering-friendly representation that integrates cross-document information during the pre-training stage. However, it is necessary to obtain a prediction vector with the same dimension as the corresponding document-topic vector. Two fully connected layers are used as the decoder.
[0090]
[0091] D∈R 1×r , where r is the length of the document-topic distribution vector. ReLU is a linear rectified activation function that sets negative values in the input to zero, while W1 and W2 are linear layers with learnable parameters used to reduce the dimensionality of the text representation vector M to match the document-topic vector. The result d ij ∈D represents the probability that sample i belongs to topic j, and D is a probability distribution.
[0092] Specifically, the cross-document topic information objective function L topic ,for:
[0093]
[0094] In the formula, L topic For cross-document topic information objective function, KL divergence, d ij ∈D represents the probability that sample i belongs to topic j, where D is a probability distribution, f ij ∈F represents the probability that sample i belongs to topic j, F is the document-topic distribution of the text calculated by the topic model LDA, and the subscripts i and j represent sample i and topic j, respectively.
[0095] In step 3) above, the joint loss function is:
[0096] L train =αL cluster+L topic
[0097] Among them, L train For the joint loss function, L cluster L is the clustering objective function. topic The objective function is cross-document topic information. α is a hyperparameter that is continuously optimized during the clustering process of the clustering algorithm. train This updates the model parameters. It's worth noting that in the early stages of the algorithm iteration, the features extracted by the model are not ideal, exhibiting excessively large α values in L. cluster There may be side effects. A simple climbing strategy is used to adjust α, making it gradually increase with training iterations, as shown in the following equation:
[0098]
[0099] Where α max =0.3, t is the number of iterations, and T′ is the set fixed maximum number of iterations.
[0100] Wherein, the clustering objective function L cluster To construct an end-to-end text clustering model and to provide a friendly clustering representation for the autoencoder BiGRU, a batch-based clustering loss is proposed to improve the clustering performance of the output features. This loss is expressed as follows:
[0101]
[0102] Where m is the number of samples in each batch, K is the number of cluster centers in each batch (i.e., the number of clusters, 12), c is the cluster center in each batch, and x is the number of samples in each batch. In each training iteration, the K-Means algorithm is used to cluster the features within the batch to obtain the cluster centers c and L. cluster This will cause the features within the batch to converge toward the cluster center c.
[0103] Example: This example further illustrates the present invention and provides corresponding experimental evidence.
[0104] Experimental Environment and Evaluation Metrics: The programming language used in the experiment was Python 3.8, the integrated development environment was PyTorch 1.9.0, and the experimental environment was a workstation running Ubuntu 16.04. This workstation had 8 NVIDIA TITAN RTX 24GB GPUs, 256GB of RAM, and 80... Gold 6248 CPU @ 2.50GHz. The experiment used the AdamW optimizer with a learning rate of 1e-3 and a linear decay strategy, and trained for 32 epochs with a batch size of 256.
[0105] The BiGRU fully connected network is set to 2 layers, with 50 and 100 units respectively. The generated sentence representation vector dimension is set to 14. The initial parameters are generated by random uniform distribution and updated using stochastic gradient descent (SGD).
[0106] This embodiment uses Cluster Accuracy (ACC), Normalized Mutual Information (NMI), RAND Index (ARI), and F-score. ACC is used to evaluate the accuracy of predictions of experimental classes compared to the true classes. NMI and F-score are used to assess how closely the predicted experimental classes resemble the true classes. ARI measures the closeness of the two data distributions and is a commonly used metric for evaluating clustering performance. These are all commonly used metrics for evaluating clustering performance and will not be described in detail here.
[0107] Comparative experiments: This embodiment includes 6 comparative experiments, as follows:
[0108] Comparison Method M1: A clustering framework based on the K-Means clustering algorithm, which uses a spatial vector model and a word frequency-inverse text frequency index feature weighting scheme to represent text for clustering. This method serves as the baseline method, and for ease of description later, it is named M1.
[0109] Comparison Method M2: A clustering framework based on the K-Means clustering algorithm, using the BERT model for text representation. For ease of description later, this clustering method is named M2.
[0110] In contrast to method M3, the TAE model optimizes text representation by combining the BoW method and neural sentence embedding, and proposes a self-supervised method with document-topic information distribution as the target, named M4.
[0111] Comparison with method M4: SIF-Auto is an algorithm that uses SIF word vectors for text representation, then uses an autoencoder network for feature extraction, and uses a clustering distribution as an auxiliary target distribution as self-supervised clustering. It is named M5.
[0112] Comparison method M5: BERT_AE_K-Means is the text representation method proposed in this paper. Both use the pre-trained model BERT to extract the semantic representation of the text, extract features through an autoencoder, and use clustering targets as auxiliary target distributions for self-supervision, hence the name M5.
[0113] Comparative Experiment Results Analysis: Based on the above comparative experiment design, five comparative experiments were conducted on the constructed medical problem dataset. The clustering effect of each model is shown in Table 2.
[0114] Table 2 Clustering Results of Different Models
[0115]
[0116] The MF-DSC text clustering method outperforms other models in NMI, ARI, ACC, and F1 scores of 0.4346, 0.4934, 0.8649, and 0.5737, respectively. Compared to traditional spatial vector models M1 and pre-trained models M2 for text representation clustering, MF-DSC text clustering performs better across all metrics. Spatial vector models, when processing large-scale texts, increase the dimensionality of text vectors and neglect semantic relationships between words. Furthermore, the high-dimensional data generated by the pre-trained model BERT fails to adequately represent the semantic information of the text. Specifically, in the pre-training model M2 (Pre-BERT) and BERT, one utilizes a knowledge masking strategy for pre-training, while the other is not pre-trained. The pre-trained Pre-BERT significantly improves clustering performance, enhancing text representation capabilities, reducing the difference in data feature distribution between the pre-training corpus and the target domain corpus, and obtaining dynamic word vectors that better reflect the semantic environment. M3, M4, and M5 are models proposed in recent years that perform well on short text clustering tasks. Compared with the comparison method M3, they all show improvements in four metrics: NMI, ARI, ACC, and F1. The improvement in NMI is 3.44 percentage points. Both MF-DSC and the comparison method M3 use cross-document topic information as a self-supervised objective. MF-DSC further integrates the clustering objective function into the clustering framework to obtain a more user-friendly clustering representation. Compared with the comparison methods M4 and M5, they show significant improvements in NMI, ARI, ACC, and F1. M4's SIF uses a weighted average of the word vectors of all words and then removes common parts. Due to the short and sparse features of the problem text, this leads to more information loss, making it the least ideal model in the experimental results. M5 uses an autoencoder for feature extraction and dimensionality reduction on top of the pre-trained model, achieving better results than M2's Pre-BERT.
[0117] Ablation Experiment Analysis: To verify the roles of word frequency features, semantic features, and the two auxiliary objective functions in the overall algorithm, this invention conducted an ablation experiment on MF-DSC. Firstly, regarding feature ablation, the absence of word frequency features reduced the NMI, ARI, ACC, and F1 scores by 0.0331, 0.0111, 0.0017, and 0.0117 points, respectively; the absence of semantic features reduced the NMI, ARI, ACC, and F1 scores by 0.0146, 0.0212, 0.0079, and 0.0164 points, respectively. Removing either word frequency or semantic features lowers the clustering scores. Word frequency features primarily affect the similarity between clustered clusters and the true clusters, while semantic features primarily affect the degree of agreement between the two data distributions. Regarding auxiliary objective ablation, if L... topic The NMI, ARI, ACC, and F1 indicators decreased by 0.0393, 0.0168, 0.0031, and 0.0152 points respectively; if there is no L cluster The NMI, ARI, ACC, and F1 indicators decreased by 0.0128, 0.0256, 0.0087, and 0.0203 respectively, with no L. topic or L cluster All of these will affect the text clustering results, among which L topic Learning cross-document information allows deep learning networks to have a greater impact on the similarity between clustered categories and the true categories, as shown in Table 3.
[0118] Table 3 Model Ablation Experiment
[0119]
[0120] Model generalization analysis: The performance improvement of the model on the medical problem dataset has been verified, demonstrating the effectiveness of the bi-objective self-supervised text clustering method based on multi-feature fusion. To further solidify the conclusions, experiments were conducted on six text datasets to verify the reliability of the method.
[0121] Generalization experiments were conducted on six text datasets: Twitter US Airline Sentiment dataset, 20Newsgroups dataset, online shopping 10cats dataset, BDCI2018 dataset, Illness-dataset dataset, and AGNews dataset. The specific details of the training set, test set, number of classes, etc. of the six datasets are shown in the table.
[0122] Twitter US Airline Sentiment: This dataset contains a record of comments about US airlines on Twitter and is often used for sentiment recognition tasks. It includes the tweet ID, the sentiment of the tweet (i.e., three categories: neutral, negative, and positive), the reason for the negative tweet, the airline name, and the tweet text.
[0123] 20Newsgroups: This dataset is a collection of approximately 20,000 English newsgroup documents, with 20 different news categories, each with roughly the same number of documents.
[0124] Online shopping 10cats: This dataset comes from product review data from e-commerce websites and includes 10 categories: books, tablets, mobile phones, fruits, shampoo, water heaters, Mengniu milk, clothing, computers, and hotels.
[0125] BDCI2018: This dataset comes from user reviews on car forums and includes 10 categories: power, price, interior, configuration, safety, appearance, handling, fuel consumption, space, and comfort.
[0126] Illness-dataset: This dataset consists of 22,660 English tweets collected in 2018 and 2019, covering four categories: Alzheimer's disease, Parkinson's disease, cancer, and diabetes.
[0127] AGNews: This dataset was collected by the academic news search engine ComeToMyHead, with over 2,000 news data sources. The dataset contains 120,000 training samples and 7,600 test samples. There are four labels, but their specific meanings are not provided. The text includes news titles and body text; this invention only uses the news titles.
[0128] Table 4 shows the details of the training set size, test set size, number of categories, average text length, maximum length, minimum length, and language for each dataset.
[0129] Table 4. Information on each text dataset
[0130]
[0131] Table 5 shows the standard mutual information (NMI), RAND index (ARI), clustering accuracy (ACC), and F1 score of the three comparison methods and the text clustering method MF-DSC of this invention on six public datasets. As can be seen from Table 5, except for the Illness-dataset where the MF-DSC text clustering method is slightly inferior to the baseline model in terms of ACC, the MF-DSC text clustering method outperforms other models on other datasets.
[0132] Table 5. Model generalization experiment results
[0133]
[0134] For example, the NMI values of different models on different datasets Figure 6 As shown, among the four models, MF-DSC performed best on all six datasets. In the Online Shopping 10cats dataset, it achieved the highest NMI of 0.4319, followed by 20Newsgroups and BDCI2018, and then AGNews. The NMI was lower in Twitter US Airline Sentiment and Illness-dataset. Online Shopping 10cats is a Chinese product review dataset, and MF-DSC showed good generalization performance on this dataset. 20Newsgroups consists of 20 categories of long English text, and BDCI2018 consists of 10 categories of short Chinese text; their NMI performance is roughly the same. The lower NMI in Twitter US Airline Sentiment, Illness-dataset, and AGNews may be related to the relatively small number of categories in these datasets, leading to unclear category classification and a document potentially belonging to different categories.
[0135] Based on comparative experiments, the multi-feature fusion dual-objective self-supervised text clustering model MF-DSC integrates word frequency features, lexical semantic features, and cross-document features to obtain deep semantics through deep learning. An end-to-end clustering model jointly optimized by cross-document topic objectives and clustering objectives is constructed to obtain a more user-friendly clustering representation. It has a significant advantage in clustering performance. At the same time, generalization experiments were conducted on six datasets, and the model has strong generalization ability.
[0136] In summary, the multi-feature fusion dual-objective self-supervised question text clustering model MF-DSC of this invention uses deep learning to obtain deep semantic relationships and introduces cross-document topic information and clustering objectives to construct an end-to-end question text clustering method. The effectiveness of the model is validated using a medical question text dataset, outperforming the baseline model and four other comparative models in terms of NMI, ARI, ACC, and F-score. Ablation experiments are used to verify the roles of word frequency features, semantic features, and the two auxiliary objective functions in the overall algorithm. Experiments on six public datasets are conducted to verify the model's generalization ability.
[0137] In one embodiment of the present invention, a dual-objective self-supervised medical problem text clustering system with multi-feature fusion is provided, comprising:
[0138] The first processing module extracts word frequency information and lexical semantic information from the medical problem text data source, and fuses the word frequency information and lexical semantic information to obtain a weighted word vector.
[0139] The second processing module takes the weighted word vectors as input to the BiGRU model, performs deep learning and introduces a self-attention mechanism to calculate the weight of each word in the medical question text and obtain the semantic relationship between words.
[0140] The third processing module constructs a cross-document topic information objective function based on the semantic relationships between words, and integrates the clustering objective function into the network loss function of the learned representation. It constructs a joint loss function with the cross-document topic information objective function, and integrates the text representation and clustering results into a unified framework to achieve end-to-end dual-objective self-supervised text clustering.
[0141] In the above embodiments, the weighted word vector is obtained by fusing word frequency information and lexical semantic information, including:
[0142] The lexical semantic weights of the fast text classifier fastText are weighted by the term frequency-inverse text frequency index (TF-IDF) to reduce the influence of non-keyword features, improve topic discrimination ability, and obtain the weight features of words in medical question texts.
[0143] The weighted word vectors are obtained by multiplying the semantic features of the words with the corresponding TF-TDF values.
[0144] In the above embodiments, before extracting word frequency information and lexical semantic information from the medical problem text data source, a step of processing the medical problem text data is also included:
[0145] The medical problem text data was cleaned to remove meaningless noise data, and the problems of misuse and incorrect use of punctuation marks were corrected.
[0146] After data cleaning, the medical question text data is segmented into words, and then categorized.
[0147] In the above embodiments, weighted word vectors are used as input to the BiGRU model for deep learning, and a self-attention mechanism is introduced, including:
[0148] BiGRU is used as the nonlinear function f1 to obtain information about long-term dependent neurons. Each GRU unit is equipped with two sigmoid gates: a reset gate and an update gate. The dependency relationship between adjacent words in a single question text is obtained through a bidirectional structure.
[0149] By introducing a self-attention mechanism on top of BiGRU, different aspects of a sentence can be extracted into multiple vector representations; more information can be obtained without any additional input, and the hidden representations at all times can be directly accessed.
[0150] Among these, the various mechanisms of self-attention are:
[0151]
[0152]
[0153] In the formula, W1 and W2 are linear layers with learnable parameters, tanh is used as an activation layer to introduce a nonlinear transformation, T is the transpose of the matrix, H is the output feature of the GRU, and Z represents the output feature of the linear layer; A is the self-attention matrix, exp represents the exponential function, and each value in A represents the correlation between the feature value and other feature values; by multiplying by the attention matrix A and the input H, the intermediate features enhanced by the self-attention mechanism are obtained. The output features are then calculated through a linear projection layer.
[0154] In the above embodiments, the objective function for cross-document topic information is:
[0155]
[0156]
[0157] In the formula, L topic For cross-document topic information objective function, KL divergence, d ij ∈D represents the probability that sample i belongs to topic j, f ij Let F represent the probability that sample i belongs to topic j, where F is the document-topic distribution of the text calculated by the topic model LDA, and the subscripts i and j represent sample i and topic j, respectively; D is a probability distribution, D∈R. 1×r, where r is the length of the document-topic distribution vector, ReLU is the linearly modified activation function, and W1 and W2 are linear layers with learnable parameters.
[0158] In the above embodiments, the joint loss function is:
[0159] L train =αL cluster +L topic
[0160] Among them, L train For the joint loss function, L cluster L is the clustering objective function. topic Let α be the objective function for cross-document topic information, and α be a hyperparameter.
[0161] The system provided in this embodiment is used to execute the above-described method embodiments. For specific processes and details, please refer to the above embodiments, which will not be repeated here.
[0162] In one embodiment of the present invention, a computing device structure is provided. This computing device can be a terminal, which may include: a processor, a communication interface, memory, a display screen, and an input device. The processor, communication interface, and memory communicate with each other via a communication bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and a computer program. When executed by the processor, the computer program implements a multi-feature fusion-based dual-objective self-supervised medical problem text clustering method. The internal memory provides an environment for the operation of the operating system and computer program in the non-volatile storage medium. The communication interface is used for wired or wireless communication with external terminals. Wireless communication can be achieved through Wi-Fi, a management network, NFC (Near Field Communication), or other technologies. The display screen can be a liquid crystal display or an e-ink display. The input device can be a touch layer covering the display screen, or buttons, a trackball, or a touchpad mounted on the casing of the computing device, or an external keyboard, touchpad, or mouse. The processor can call logical instructions stored in the memory.
[0163] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0164] In one embodiment of the present invention, a computer program product is provided, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, and when the program instructions are executed by a computer, the computer is able to perform the methods provided in the above-described method embodiments.
[0165] In one embodiment of the present invention, a non-transitory computer-readable storage medium is provided, which stores server instructions that cause a computer to perform the methods provided in the above embodiments.
[0166] The computer-readable storage medium provided in the above embodiments has a similar implementation principle and technical effect to the above method embodiments, and will not be described again here.
[0167] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0168] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0169] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0170] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multi-feature fusion, dual-objective self-supervised medical problem text clustering method, characterized in that, include: The word frequency and semantic information of medical question text data sources are extracted, and the word frequency and semantic information are fused to obtain weighted word vectors. This includes: using the word frequency-inverse text frequency index (TF-IDF) to weight the semantic information of the fast text classifier fastText, reducing the influence of non-keyword features, improving topic discrimination ability, and obtaining the weight features of words in medical question texts; multiplying the semantic features of words with the corresponding TF-TDF values to obtain weighted word vectors. The weighted word vectors are used as input to the BiGRU model. Deep learning is performed and a self-attention mechanism is introduced to calculate the weight of each word in the medical question text, obtain the semantic relationship between words, and output the text representation vector of each medical question text. Clustering is performed based on the text representation vector. Based on text representation vectors, cross-document topic information objective function and clustering objective function are constructed, and the clustering objective function is integrated into the network loss function of the learned representation. A joint loss function is constructed with the cross-document topic information objective function, and the text representation and clustering results are integrated into a unified framework to achieve end-to-end dual-objective self-supervised text clustering. The objective function for cross-document topic information is: In the formula, For the objective function of cross-document topic information, KL divergence, Indicates sample Belongs to the topic The probability, To represent the sample Belongs to the topic The probability is F, which is the document-topic distribution of the text calculated by the topic model LDA, and the subscripts i and j represent sample i and topic j, respectively. The probability distribution calculated by the decoder. ,in The length of the document-topic distribution vector is denoted by , and ReLU is the linearly modified activation function. and It is a linear layer with learnable parameters; The text is represented as a vector; The joint loss function is: in, For the joint loss function, The clustering objective function is... For cross-document topic information objective function, This is a hyperparameter.
2. The multi-feature fusion dual-objective self-supervised medical problem text clustering method as described in claim 1, characterized in that, Before extracting word frequency and semantic information from medical question text data sources, the process also includes steps to process the medical question text data: The medical problem text data was cleaned to remove meaningless noise data, and the problems of misuse and incorrect use of punctuation marks were corrected. After data cleaning, the medical question text data is segmented into words, and then categorized.
3. The multi-feature fusion dual-objective self-supervised medical problem text clustering method as described in claim 1, characterized in that, Weighted word vectors are used as input to the BiGRU model for deep learning, and a self-attention mechanism is introduced, including: Using BiGRU as the nonlinear function To obtain information about long-term dependent neurons, each GRU unit is equipped with two Sigmoid gates: a reset gate and an update gate; the dependencies between adjacent words in a single question text are obtained through a bidirectional structure. By introducing a self-attention mechanism on top of BiGRU, different aspects of a sentence can be extracted into multiple vector representations; more information can be obtained without any additional input, and the hidden representations at all times can be directly accessed.
4. The multi-feature fusion dual-objective self-supervised medical problem text clustering method as described in claim 3, characterized in that, The multifaceted mechanisms of self-attention are: In the formula, and It is a linear layer with learnable parameters. tanh is used as the activation layer to introduce a nonlinear transformation. T is the transpose of the matrix, H is the output feature of the GRU, and Z represents the output feature of the linear layer. A is the self-attention matrix, and exp represents the exponential function. Each value in the table represents the correlation between the feature value and other feature values; The intermediate features enhanced by the self-attention mechanism are obtained by multiplying by the attention matrix A and the input H. , The output features are then calculated through a linear projection layer.
5. A multi-feature fusion dual-objective self-supervised medical problem text clustering system, used to implement the multi-feature fusion dual-objective self-supervised medical problem text clustering method as described in any one of claims 1 to 4, characterized in that, include: The first processing module extracts word frequency information and lexical semantic information from the medical problem text data source, and fuses the word frequency information and lexical semantic information to obtain a weighted word vector. The second processing module takes the weighted word vectors as input to the BiGRU model, performs deep learning and introduces a self-attention mechanism to calculate the weight of each word in the medical question text and obtain the semantic relationship between words. The third processing module constructs a cross-document topic information objective function based on the semantic relationships between words, and integrates the clustering objective function into the network loss function of the learned representation. It constructs a joint loss function with the cross-document topic information objective function, and integrates the text representation and clustering results into a unified framework to achieve end-to-end dual-objective self-supervised text clustering.
6. A computer-readable storage medium for storing one or more programs, characterized in that, The one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods described in claims 1 to 4.
7. A computing device, characterized in that, include: One or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described in claims 1 to 4.