Text disease classification system based on semantic label feedback and symptom association penalty mechanism

CN122240835APending Publication Date: 2026-06-19SOUTH CHINA UNIV OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SOUTH CHINA UNIV OF TECH
Filing Date: 2026-03-03
Publication Date: 2026-06-19

Application Information

Patent Timeline

03 Mar 2026

Application

19 Jun 2026

Publication

CN122240835A

IPC: G06F16/353; G06F40/30; G16H10/60; G06N3/045; G06N3/0442

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Health medical text automatic classification and safety level automatic grading method
CN114722208A
Intelligent triage method and system based on pre-inquiry text clustering and mode recognition
CN121747878A
Calculation method and system for unstructured text data
CN119474383A
Computational method and system for unstructured text data
CN119474383B
Disease classification ICD automatic coding method and device based on comparative learning
CN116822579A

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies for disease identification in traditional Chinese medicine texts suffer from insufficient utilization of semantic information in disease labels, lack of medical association constraints between diseases and symptoms, and unbalanced distribution of data categories, resulting in insufficient identification accuracy and medical rationality.

⚗Method used

A text-based disease classification system based on semantic label feedback and symptom association penalty mechanism is adopted. Multi-layer symptom semantic features are extracted through a pre-trained language model, joint semantic features are constructed, and a disease-symptom association matrix is introduced for penalty constraints to achieve two-stage optimization training.

🎯Benefits of technology

It improved the model's ability to understand semantic differences and co-occurrence relationships among diseases, enhanced the medical rationality of disease identification results and the ability to identify low-frequency disease categories, and significantly improved identification accuracy and stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240835A_ABST

Patent Text Reader

Abstract

This invention discloses a text-based disease classification system based on semantic label feedback and a symptom association penalty mechanism. The system includes a first classification module and a second classification module. The first classification module incorporates a pre-trained language model to extract multi-layered symptom semantic features from medical text data and outputs candidate disease labels based on these features. The second classification module constructs joint semantic features based on the multi-layered symptom semantic features and the candidate disease labels, and outputs the disease category distribution probability based on these joint semantic features using a disease recognition model. This invention effectively enhances the semantic correlation between symptom text and disease labels, alleviating the label imbalance problem while improving the accuracy and medical interpretability of text-based disease recognition.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and more specifically to a text-based disease classification system based on semantic label feedback and symptom-related punishment mechanisms. Background Technology

[0002] Text-based disease classification and recognition is a key foundational task in intelligent medical text analysis. Its goal is to automatically identify and determine the disease categories involved in unstructured medical text, with significant application value in clinical auxiliary diagnosis, medical knowledge graph construction, and intelligent consultation systems. Especially in the field of Traditional Chinese Medicine, disease category recognition tasks have even greater complexity and practical significance.

[0003] Currently, TCM texts come from a wide range of sources, including electronic medical records, medical case records, ancient books and online consultation texts. Their language forms combine the characteristics of modern Chinese and classical Chinese, and generally have features such as obscure expression, abstract semantics, and discrete symptom descriptions. The same disease is often expressed through a combination of multiple symptoms, and there is no one-to-one correspondence between symptoms and diseases, making it difficult to identify diseases in TCM texts through simple keyword matching or rule-based methods.

[0004] In current technologies, text-based disease identification methods mainly rely on deep learning-based text classification models, such as convolutional neural networks, recurrent neural networks, and attention-based Transformer models. These models typically encode the entire text into a semantic vector and output disease classification prediction results directly through a linear classifier.

[0005] While the above methods have achieved some success in general medical texts, they still have significant shortcomings in the context of traditional Chinese medicine (TCM) texts. On the one hand, when faced with complex texts with incomplete symptom descriptions, vague expressions, or multiple co-occurring diseases, the model mainly relies on semantic similarity for discrimination, which can easily produce predictions that are semantically acceptable but logically unreasonable. On the other hand, TCM disease data usually exhibits a significant long-tail distribution, and the traditional cross-entropy loss function tends to favor high-frequency disease categories during training, further weakening the model's ability to identify low-frequency diseases.

[0006] Furthermore, existing methods mostly employ single-stage optimization strategies, meaning they train the model using only a single loss function without differentiating the learning focus of the model at different training stages. They also generally treat disease labels as independent discrete categories, lacking modeling of the semantic information of the labels and their inherent medical relationships. Even though some studies introduce class weights or asymmetric loss functions to alleviate the sample imbalance problem, their improvements mainly remain at the level of adjusting weights based on sample size.

[0007] Therefore, overcoming the above-mentioned shortcomings remains a key technical problem that urgently needs to be solved in the current process of disease category identification in traditional Chinese medicine texts. Summary of the Invention

[0008] In view of the above problems, the present invention provides a text-based disease classification system based on semantic label feedback and symptom association penalty mechanism, which aims to overcome the problems of insufficient utilization of semantic information of disease labels, lack of medical association constraints between diseases and symptoms, and unbalanced distribution of data categories in the prior art, which lead to insufficient accuracy and medical rationality of text-based disease identification.

[0009] To achieve the above objectives, the present invention adopts the following technical solution:

[0010] In a first aspect, embodiments of the present invention provide a text-based disease classification system based on semantic label feedback and a symptom association penalty mechanism, comprising: The first classification module has a built-in pre-trained language model, which is used to extract multi-layer symptom semantic features from medical text data and output candidate disease labels based on the multi-layer symptom semantic features. The second classification module is used to construct joint semantic features based on the multi-layer symptom semantic features and the candidate disease labels, and output the disease category distribution probability through the disease identification model according to the joint semantic features.

[0011] Preferably, the first classification module includes a data preprocessing unit for acquiring medical text data and performing preprocessing, the preprocessing including: Medical text data is segmented and encoded to obtain a standard text input sequence; The standard text input sequence is rewritten using a generative large language model to perform synonym rewriting or expression replacement, resulting in enhanced text. Calculate the cosine similarity between the standard text input sequence and the enhanced text; When the cosine similarity falls within a preset threshold range, the enhanced text corresponding to the current cosine similarity is retained.

[0012] Preferably, in the first classification module, the pre-trained language model uses a weighted cross-entropy loss function as the first-stage classification loss function; the category loss weights are calculated using the following formula:

[0013] In the formula, Indicating the first stage Loss weights corresponding to disease categories This represents the disease category index in the first stage of prediction. This indicates the total number of disease categories. Indicates the first Number of samples corresponding to a disease category.

[0014] Preferably, the joint semantic features of the multi-layer symptom semantic features and the candidate disease labels are constructed, including: The semantic embedding vector of the candidate disease label is obtained by a semantic label encoder, and the semantic embedding vector of the label and the multi-layer symptom semantic features are projected into the same semantic space. Construct a bidirectional cross-attention structure: using multi-layer symptom semantic features as query vectors and label semantic embedding vectors as keys and values to obtain label-enhanced features; using label semantic embedding vectors as query vectors and multi-layer symptom semantic features as keys and values to obtain text-enhanced features. The label enhancement features and the text enhancement features are fused using learnable weights to obtain joint semantic features.

[0015] Preferably, the semantic label encoder keeps its parameters frozen during training to ensure the stability of the label semantic embedding vector.

[0016] As a preferred option, a disease-symptom association penalty mechanism is introduced during the training of the second classification module, including: Construct a disease-symptom association matrix; Calculate the KL divergence between the disease category distribution probability output by the disease identification model and the disease distribution corresponding to the symptoms in the disease-symptom association matrix; The KL divergence is used as a penalty term for disease-symptom association.

[0017] Preferably, the disease-symptom association matrix is constructed based on medical knowledge graphs, electronic medical records, clinical statistics, and external medical knowledge statistics, wherein the matrix rows correspond to disease categories and the columns correspond to symptom elements.

[0018] Preferably, the disease-symptom association penalty term is represented as follows:

[0019]

[0020] In the formula, This represents the KL divergence penalty term. This indicates the total number of disease categories. This represents the disease category index predicted by the second classification module. The text indicating medical symptoms corresponds to the first Prior probability of a class of diseases The medical text predicted by the second classification module belongs to the first category. The probability value of the disease type This represents a smoothing constant to prevent the denominator from being zero. This indicates the number of times the symptom category index has been traversed. It is the total number of symptoms. It is the first Class-target diseases and the first The strength of the association between symptoms Indicates the first Does the symptom-like symptom appear in the current sample?

[0021] Secondly, embodiments of the present invention provide a text-based disease classification method based on semantic label feedback and symptom association penalty mechanism. This method applies the text-based disease classification system based on semantic label feedback and symptom association penalty mechanism as described in any of the preceding claims, and includes the following steps: In the first stage, multi-layer symptom semantic features are extracted from medical text data through a pre-trained language model, and candidate disease labels are output based on the multi-layer symptom semantic features. In the second stage, joint semantic features are constructed based on the multi-layer symptom semantic features and the candidate disease labels, and the disease category distribution probability is output through the disease identification model according to the joint semantic features.

[0022] This invention provides a text-based disease classification system based on semantic label feedback and symptom association penalty mechanism. Compared with existing technologies, its advantages include: (1) By introducing a semantic label feedback mechanism, this invention extends disease labels from discrete categories to learnable semantic representations, realizes joint modeling of symptom text semantics and disease label semantics, and effectively improves the model’s ability to understand semantic differences and co-occurrence relationships between diseases. (2) By using a penalty constraint mechanism based on the medical association between disease and symptoms, medical prior knowledge is explicitly introduced into the model optimization process, which effectively reduces the number of prediction results that are semantically reasonable but medically inconsistent, and improves the medical rationality of disease identification results. (3) The two-stage disease identification and two-stage optimization training strategy proposed in this invention can alleviate the adverse effects of the long-tail distribution of TCM text disease data while ensuring the overall classification accuracy, and significantly improve the model's ability to identify low-frequency disease categories and training stability. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0024] Figure 1 This is a diagram of the pre-trained backbone model architecture provided in this embodiment of the invention; Figure 2 This is a schematic diagram of the overall structure of the text disease classification model provided in this embodiment of the invention; Figure 3 This is a schematic diagram of the semantic tag feedback module provided in an embodiment of the present invention;

[0025] Figure 4 This is a schematic diagram illustrating the construction process of the disease-symptom association penalty matrix provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the disease-symptom association punishment mechanism and two-stage optimization training provided in the embodiments of the present invention. Detailed Implementation

[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0027] To address the shortcomings of existing technologies in effectively constraining and guiding model prediction results by incorporating objectively existing medical knowledge about the correlation between diseases and symptoms, and the lack of consistency constraints between model prediction distribution and actual disease-symptom matching relationships, this invention discloses a text-based disease classification system based on semantic label feedback and a symptom association penalty mechanism, comprising: The first classification module has a built-in pre-trained language model, which is used to extract multi-layer symptom semantic features from medical text data and output candidate disease labels based on the multi-layer symptom semantic features. The second classification module is used to construct joint semantic features based on the multi-layer symptom semantic features and the candidate disease labels, and output the disease category distribution probability through the disease identification model according to the joint semantic features.

[0028] The following is a description through specific embodiments.

[0029] In one embodiment, a medical text disease recognition dataset is first constructed. This dataset originates from public medical text datasets and real TCM medical case data from partner institutions, comprising four sub-datasets, each with approximately 40,000 samples and about 120 disease categories. Each sample corresponds to a symptom description text and is uniquely labeled with a disease category, making it a single-label text disease recognition task. The average length of the symptom text is approximately 128 Chinese characters.

[0030] Then as Figure 1As shown, the original TCM text is cleaned, segmented, and standardized using a text segmentation and encoding preprocessing module. For disease categories with insufficient sample sizes, a generative post-screening data augmentation strategy based on semantic consistency is introduced to expand the number of low-frequency disease samples, thereby alleviating the problem of uneven disease category distribution in the dataset and obtaining standardized symptom text vector representations. Some implementation schemes specifically include: cleaning the original symptom text sequentially, removing irrelevant symbols and excessive stop words, and using a word segmenter based on a pre-trained language model to perform sub-word-level segmentation and encoding to obtain standardized text input sequence representations. To alleviate the problems of insufficient sample sizes and uneven category distribution for some disease categories, this embodiment further utilizes a generative large language model interface to perform synonym rewriting or symptom description replacement on the original symptom text to generate candidate augmented text. Subsequently, independent word vector embedding layers are used to vectorize the original text and augmented text respectively, and the cosine similarity between the two is calculated. Only when the similarity falls within a preset threshold range is the corresponding augmented text retained as a valid augmented sample, thereby expanding the number of low-frequency disease category samples while ensuring semantic consistency.

[0031] In one embodiment, the first classification module uses a pre-trained language model as the backbone network to extract multi-layer symptom semantic features from medical text data and output candidate disease labels based on the multi-layer symptom semantic features.

[0032] In this embodiment, the pre-trained language model includes a multi-layer semantic encoder for extracting multi-level semantic features. The semantic encoder consists of multiple stacked sequence modeling units, used to perform contextual modeling of the symptom text and extract symptom text features at different semantic levels. Specifically, the symptom text vector representation is input into the pre-trained language model, and the semantic features of its three-to-last hidden layer are extracted. These multi-layer features are then fused using a multi-layer semantic feature fusion module to obtain a symptom text semantic feature representation containing different levels of semantic abstraction. The fused symptom text semantic features are then passed through a fully connected classification layer to output prediction scores for all disease categories, and normalized using a Sigmoid function to obtain the disease confidence distribution. Based on the confidence ranking results, the top-5 disease labels with the highest scores are selected as a candidate disease set to achieve coarse-grained screening of possible disease categories for disease discrimination in the subsequent second stage, thereby effectively reducing the complexity of the subsequent discrimination space while ensuring recall.

[0033] Furthermore, in a preferred embodiment, to mitigate the impact of imbalanced disease category distribution on model training, in the first classification module, the pre-trained language model employs a weighted binary cross-entropy loss function, designed to alleviate imbalanced disease category distribution, as the classification loss function for the first stage. This enhances the model's ability to identify low-frequency disease categories and recall the most likely disease categories. Simultaneously, the category weights... Normalization is performed before the loss is calculated to ensure the stability of the overall loss scale, thereby avoiding adverse effects on the model parameter update process.

[0034] In this implementation plan, the classification loss function for the first stage is as follows:

[0035]

[0036] In the formula, The total number of disease categories. This represents the disease category index in the first stage of prediction. It is the first stage Loss weights corresponding to disease categories The sample is in the first The real labels corresponding to the disease categories The medical text predicted by the pre-trained language model belongs to the first... The probability value of the disease type It is the first Number of samples corresponding to a disease category.

[0037] In another embodiment, the second classification module is used to construct the joint semantic features of the multi-layer symptom semantic features and the candidate disease labels, and output the disease category distribution probability through the disease identification module based on the joint semantic features.

[0038] In this embodiment, as Figure 2As shown, to explicitly model the association between the semantics of symptom text and the semantics of disease labels, a semantic label feedback module is introduced. Specifically, the semantic embedding vectors corresponding to candidate disease labels are used as feedback information and mapped to a shared semantic space along with the semantic features of the symptom text. A bidirectional cross-attention mechanism is used to enhance the interaction between the semantics of symptom text and disease labels, explicitly modeling the semantic association between text content and disease labels, and generating a joint semantic representation for disease discrimination. This embodiment, through fine-grained modeling of the consistency between the semantics of symptom text and disease labels, can achieve accurate screening of candidate diseases and final disease identification output, thereby improving the model's discrimination ability in scenarios with multiple co-occurring diseases and semantically ambiguous text.

[0039] In some implementations, the candidate disease label names generated in the first stage are input into a separate text encoder to obtain the corresponding disease label semantic embedding vectors, with an embedding dimension of 768. The text encoder maintains parameter freeze during training to ensure the stability of the disease label semantic representation. Subsequently, two learnable linear mapping matrices are used to project the multi-layer semantic features of the symptom text and the disease label semantic embeddings onto the same shared semantic space. The projected symptom text features and label embedding features are then normalized to ensure consistent numerical scales across different feature types.

[0040] Within this shared semantic space, the semantic interaction enhancement module explicitly models the semantic association between text content and disease labels; such as Figure 3 ,include: On the one hand, the semantic features of symptom text are used as query vectors. Disease label semantics as key vectors Sum value vector Multi-head attention is used to calculate the semantic attention weights of text to labels, thereby generating label-enhancing features. ;

[0041] On the other hand, disease label semantics are used as query vectors. symptom text semantic features as key vectors Sum value vector By using multi-head attention to calculate the semantic attention weights of labels to text, text enhancement features are generated. ;

[0042] Through the aforementioned bidirectional cross-attention interaction, enhanced symptom text features and disease label features are obtained respectively, and then fused using learnable weights to obtain the enhanced text features. With label enhancement features Respectively with original features and Perform residual join and normalization:

[0043]

[0044] Then, through learnable weighted fusion, the text enhancement features and label enhancement features are integrated into a joint semantic representation, including the residual text enhancement features. With label enhancement features Pooling to a uniform dimension and concatenating along the feature dimension, then inputting into a learnable gating network to compute fusion weights. Generate a joint semantic representation:

[0045] Furthermore, in the disease identification process of the second classification module, this embodiment only performs semantic interaction and feature fusion operations under the constraints of the candidate disease set generated by the first classification module. However, the final classification head still covers all disease categories to ensure that different samples are discriminated within a unified category space. Specifically, the joint semantic representation is input into the disease identification model, all disease categories are scored, and the final single disease prediction result is output. Through the candidate disease constraint mechanism, the interference of irrelevant disease categories on the discrimination process can be effectively reduced, while avoiding the problem of category semantic inconsistency caused by dynamic changes in the classification space.

[0046] In one embodiment, to improve the medical plausibility of the model's predictions and mitigate the impact of class imbalance on model training, this embodiment introduces a disease-symptom association penalty mechanism during the training of the second classification module. Specifically, during model training, a disease-symptom association matrix is first constructed based on a medical knowledge graph, electronic medical records, clinical statistics, and external medical knowledge statistics. The matrix rows correspond to disease categories, and the columns correspond to symptom elements. The association strength is then normalized. See details below. Figure 4 This includes identifying symptom entities based on prior medical knowledge data using a symptom entity recognition model, and then performing semantic matching and standardization with a standard symptom database; further, constructing a disease-state co-occurrence matrix through a disease-symptom statistics module, and obtaining a disease-symptom association penalty matrix through a penalty weight mapping module.

[0047] Then, for the disease category probability distribution output by the disease identification model in the second classification module, the KL divergence between it and the disease distribution corresponding to the symptom elements in the input symptom text is calculated. This divergence is used as a disease-symptom association penalty term to guide the model's prediction results to conform to medical co-occurrence rules. It should be noted that in this embodiment, the disease-symptom association penalty mechanism only takes effect during the training phase and is not calculated during the inference phase.

[0048] Furthermore, in some specific implementation schemes, the difference between the disease distribution predicted by the model and the disease-symptom matching distribution is transformed into a penalty term, which is then combined with a weighted single-label cross-entropy loss function to form a joint optimization objective, represented as follows:

[0049]

[0050]

[0051] In the formula, This represents the disease category index predicted by the second classification module. This indicates the total number of disease categories. It is the first Loss weights corresponding to disease categories The sample is in the first The real labels corresponding to the disease categories The medical text predicted by the second classification module belongs to the first category. The probability value of the disease type The text indicating medical symptoms corresponds to the first Prior probability of a disease class.

[0052] in In the calculation formula, It is the first Number of samples corresponding to a disease category.

[0053] exist In the calculation formula, To prevent smoothing constants with denominators of zero, It is a symptom category index traversal volume. It is the total number of symptoms. It is the first Class-target diseases and the first The strength of the association between symptoms Indicates the first Does the symptom-like symptom appear in the current sample?

[0054] The denominator terms apply to all disease categories. Symptom categories The process involves iterating through the prior probabilities to normalize them, thereby ensuring... To form an effective probability distribution, such that

[0055] As a preferred implementation, the total loss of the second classification module and the loss of the first classification module are weighted and fused together via backpropagation to optimize the model parameters, such as... Figure 5 As shown, the total loss function is as follows:

[0056] in 、 These are weight parameters that can be dynamically adjusted during training.

[0057] Based on the same inventive concept, embodiments of the present invention also provide a text-based disease classification method based on semantic label feedback and symptom association penalty mechanism, the steps of which include: In the first stage, multi-layer symptom semantic features are extracted from medical text data through a pre-trained language model, and candidate disease labels are output based on the multi-layer symptom semantic features. In the second stage, joint semantic features are constructed based on the multi-layer symptom semantic features and the candidate disease labels, and the disease category distribution probability is output through the disease identification model according to the joint semantic features.

[0058] This invention uses a medical text dataset and expands the original symptom text through a generative post-screening data augmentation strategy. The original data and augmented data are then input into a pre-trained language model to extract multi-level symptom semantic features. In the first stage, disease categories are coarsely classified to obtain multiple candidate disease labels. Further, based on the semantic embedding information of the candidate disease labels, a bidirectional semantic augmentation mechanism driven by semantic label feedback is introduced to construct a joint semantic representation of the candidate disease label semantics and symptom text features. Under the constraint of a disease-symptom association penalty mechanism, the second stage of optimization training is carried out so that the model output results maintain semantic consistency while conforming to medical logic, thereby completing the text disease recognition task.

[0059] Since the execution process of each step in this method is consistent with the execution principle of each module in the aforementioned text disease classification system based on semantic label feedback and symptom association penalty mechanism, it will not be repeated here.

[0060] Under the experimental conditions of a specific embodiment, the deep learning framework PyTorch was used. All experiments were run on a server equipped with four NVIDIA RTX 3090ti GPUs. The server's CPU was an Intel i9-10850K, with 64 GB of memory, and the operating system was Ubuntu 18.04. Model training employed the AdamW optimizer with an initial learning rate of 0.00002 and a total of 30 training epochs, incorporating an early stopping strategy to prevent overfitting. During the model inference phase, only the text of the symptom to be identified was input into the trained model. Semantic features of the symptom text were extracted using a pre-trained language model. Based on the candidate disease labels generated in the first stage, the final disease identification result was output end-to-end through the semantic label feedback module and the second-stage discrimination module.

[0061] Compared to existing methods based on pre-trained language models, this invention demonstrates superior classification performance in TCM text disease identification tasks, improving overall accuracy from 82.19% to 84.86% and macro-average F1 score from 51.01% to 58.13%. Experimental results show that the method of this invention has a positive effect on improving the accuracy of text disease identification and the stability of model prediction. The above results are only used to illustrate the feasibility and effectiveness of the technical solution of this invention.

[0062] Furthermore, the text-based disease classification method described in this invention can be executed by a computer program and deployed on a server or terminal device to achieve automated disease identification of medical symptom texts. The above embodiments are merely specific illustrations of the technical solution of this invention and are not intended to limit the scope of protection of this invention. Equivalent transformations made by those skilled in the art to the model structure, parameter settings, or implementation methods without departing from the core idea of this invention should be considered to fall within the scope of protection of the claims of this invention.

[0063] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0064] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text-based disease classification system based on semantic label feedback and symptom association penalty mechanism, characterized in that, include: The first classification module has a built-in pre-trained language model, which is used to extract multi-layer symptom semantic features from medical text data and output candidate disease labels based on the multi-layer symptom semantic features. The second classification module is used to construct joint semantic features based on the multi-layer symptom semantic features and the candidate disease labels, and output the disease category distribution probability through the disease identification model according to the joint semantic features.

2. The text-based disease classification system according to claim 1, characterized in that, The first classification module includes a data preprocessing unit, used to acquire medical text data and perform preprocessing, the preprocessing including: Medical text data is segmented and encoded to obtain a standard text input sequence; The standard text input sequence is rewritten using a generative large language model to perform synonym rewriting or expression replacement, resulting in enhanced text. Calculate the cosine similarity between the standard text input sequence and the enhanced text; When the cosine similarity falls within a preset threshold range, the enhanced text corresponding to the current cosine similarity is retained.

3. The text-based disease classification system according to claim 1, characterized in that, In the first classification module, the pre-trained language model uses a weighted cross-entropy loss function as the first-stage classification loss function; wherein, the category loss weights are calculated using the following formula: In the formula, Indicating the first stage Loss weights corresponding to disease categories This represents the disease category index in the first stage of prediction. This indicates the total number of disease categories. Indicates the first Number of samples corresponding to a disease category.

4. The text-based disease classification system according to claim 1, characterized in that, Constructing the joint semantic features of the multi-layered symptom semantic features and the candidate disease labels includes: The semantic embedding vector of the candidate disease label is obtained by a semantic label encoder, and the semantic embedding vector of the label and the multi-layer symptom semantic features are projected into the same semantic space. Construct a bidirectional cross-attention structure: using multi-layer symptom semantic features as query vectors and label semantic embedding vectors as keys and values to obtain label-enhanced features; using label semantic embedding vectors as query vectors and multi-layer symptom semantic features as keys and values to obtain text-enhanced features. The label enhancement features and the text enhancement features are fused using learnable weights to obtain joint semantic features.

5. The text-based disease classification system according to claim 4, characterized in that, The semantic label encoder keeps its parameters frozen during training to ensure the stability of the label semantic embedding vector.

6. The text-based disease classification system according to claim 1, characterized in that, The second classification module introduces a disease-symptom association penalty mechanism during training, including: Construct a disease-symptom association matrix; Calculate the KL divergence between the disease category distribution probability output by the disease identification model and the disease distribution corresponding to the symptoms in the disease-symptom association matrix; The KL divergence is used as a penalty term for disease-symptom association.

7. The text-based disease classification system according to claim 6, characterized in that, The disease-symptom association matrix is constructed based on medical knowledge graphs, electronic medical records, clinical statistics, and external medical knowledge statistics. In this matrix, rows correspond to disease categories, and columns correspond to symptom elements.

8. The text-based disease classification system according to claim 6, characterized in that, The disease-symptom association penalty term is represented as follows: In the formula, This represents the KL divergence penalty term. This indicates the total number of disease categories. This represents the disease category index predicted by the second classification module. The text indicating medical symptoms corresponds to the first Prior probability of a class of diseases The medical text predicted by the second classification module belongs to the first category. The probability value of the disease type This represents a smoothing constant to prevent the denominator from being zero. This indicates the number of times the symptom category index has been traversed. It is the total number of symptoms. It is the first Class-target diseases and the first The strength of the association between symptoms Indicates the first Does the symptom-like symptom appear in the current sample? 9. A text-based disease classification method based on semantic label feedback and symptom association penalty mechanism, characterized in that, The text disease classification system based on semantic label feedback and symptom association penalty mechanism as described in any one of claims 1-8 includes the following steps: In the first stage, multi-layer symptom semantic features are extracted from medical text data through a pre-trained language model, and candidate disease labels are output based on the multi-layer symptom semantic features. In the second stage, joint semantic features are constructed based on the multi-layer symptom semantic features and the candidate disease labels, and the disease category distribution probability is output through the disease identification model according to the joint semantic features.