Medical knowledge base semantic matching method and device for positive and negative sample imbalance

By combining the BGE encoder and DBSCAN deduplication, the Annoy approximate nearest neighbor algorithm for filtering, and a large language model, along with a loss function constrained by topic intent consistency and a distillation loss function, the problem of imbalanced positive and negative samples in the intelligent medical dialogue system is solved, thereby improving the semantic matching accuracy of the medical knowledge base and the generalization ability of the model.

CN120994772BActive Publication Date: 2026-06-19XIAMEN KUAISHANGTONG TECH CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAMEN KUAISHANGTONG TECH CORP LTD
Filing Date
2025-07-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In intelligent medical dialogue systems, the query hit rate of medical knowledge base and the accuracy of robot interaction are affected by the imbalance of positive and negative samples. Existing technologies are unable to effectively solve the problems of data scarcity and distribution imbalance, resulting in difficulties in model training, insufficient gradient correction capabilities, high misjudgment rate of extreme samples, blurred category boundaries, and insufficient ability to extract key features.

Method used

A multi-level screening process is adopted, which combines BGE encoder with DBSCAN deduplication, Annoy approximate nearest neighbor algorithm matching and large language model filtering. The ratio of positive and negative samples is adjusted. By dynamically assigning weights based on the consistency of topic and intent, a focus loss function and distillation loss function based on topic-intent consistency constraints are constructed to perform knowledge distillation and improve the semantic matching ability of the model.

Benefits of technology

It significantly increases the proportion of similar and dissimilar samples in the training data, alleviates the problem of imbalanced training data, amplifies the loss contribution of low consistency sample pairs, forces the model to focus on implicit entities and demand types specific to the medical field, and improves the accuracy of semantic matching.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120994772B_ABST
    Figure CN120994772B_ABST
Patent Text Reader

Abstract

This invention discloses a method and apparatus for semantic matching of medical knowledge bases in the face of imbalanced positive and negative samples. The method includes: acquiring search terms and customer questions from medical scenario dialogues and combining them with matching questions from the medical knowledge base to construct training data; extracting the topic and intent of sample pairs from the training data and obtaining a topic-intent consistency classification result; constructing a focus loss function based on topic-intent consistency constraints according to the similarity labels of sample pairs in the training data, the topic-intent consistency classification result, and a first similarity vector; constructing a distillation loss function based on a student model and a teacher model; constructing a total loss function based on the focus loss function and the distillation loss function; using the total loss function to complete knowledge distillation from the teacher model to the student model to obtain a trained student model; and using the trained student model for semantic matching. This invention solves the problem of low prediction accuracy caused by the imbalanced distribution of medical dialogue data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent medical dialogue, specifically to a semantic matching method and apparatus for medical knowledge bases with imbalanced positive and negative samples. Background Technology

[0002] In the field of intelligent medical dialogue systems, the query hit rate of medical question-and-answer knowledge bases and the accuracy of robot interactions highly depend on the semantic matching model's precise understanding of medical professional data. However, current technical solutions have significant shortcomings in the following aspects:

[0003] 1. The scarcity and uneven distribution of medical dialogue data

[0004] In intelligent consultation scenarios within the healthcare field, patient symptom descriptions exhibit high diversity, and the distribution of symptom-diagnosis category data in the medical knowledge base is extremely uneven. For example: data acquisition is challenging: accurate question-and-answer pairs conforming to clinical standards account for only 10%-15% of the overall dataset, while a large amount of redundant and irrelevant data (such as combinations of symptoms with non-corresponding treatment plans) accounts for over 80%; annotation costs are high: in existing data extraction processes, annotators need to annotate massive amounts of invalid negative samples (such as irrelevant pairings like "diabetes" and "fracture treatment plan"), with this workload accounting for over 60% of the total annotation cost, leading to significant resource waste.

[0005] 2. The Dilemma of Imbalanced Data-Driven Model Training

[0006] (1) Model convergence defects and local optimal solutions

[0007] In the case of imbalanced data in medical scenarios, traditional similarity models (such as two-stream Siamese Network or interactive Cross Encoder architecture) have the following drawbacks: insufficient gradient correction capability, the gradient contribution of the model to minority class samples is excessively diluted during training, resulting in a large number of flat regions on the loss surface, eventually getting stuck in local optima.

[0008] (2) Predictive accuracy defects of FocalLoss

[0009] Although FocalLoss can improve model convergence by dynamically adjusting the weights of easy and difficult samples (when the γ parameter is ≥2), it presents the following technical challenges in medical scenarios:

[0010] 1) Increased false positive rate in extreme samples: The predicted similarity score for obviously dissimilar query pairs (such as “diabetes symptoms” and “fracture treatment plan”) is still higher than the threshold (such as 0.4), causing the F1-score on the test set to drop below 0.6;

[0011] 2) Category boundary blurring: FocalLoss lacks targeted optimization for semantic gaps specific to the medical field (such as the error rate of synonym mapping between "chest pain" and "angina pectoris" > 30%), resulting in insufficient key feature extraction capabilities. Summary of the Invention

[0012] The purpose of this application is to propose a semantic matching method and apparatus for medical knowledge bases with imbalanced positive and negative samples, addressing the aforementioned technical problems.

[0013] In a first aspect, the present invention provides a semantic matching method for medical knowledge bases with imbalanced positive and negative samples, comprising the following steps:

[0014] The process involves acquiring search terms and customer questions from medical scenario dialogues to form a query set. This query set is then combined with matching questions from a medical knowledge base to construct candidate sample pairs. These candidate pairs are filtered and labeled, and the ratio of positive to negative sample pairs is adjusted to obtain training data. The topics and intentions of the sample pairs in the training data are extracted separately, and the sample pairs are classified according to their topics and intentions to obtain the topic and intention... Figure 1 Consistency classification results;

[0015] The sample pairs from the training data are input into the student model to obtain the first projection vector and the first similarity vector. The sample pairs from the training data are input into the teacher model to obtain the second projection vector and the second similarity vector. Based on the similarity labels, topics, and meanings of the sample pairs in the training data... Figure 1 Consistency classification results and the construction of the first similarity vector are based on topic meaning Figure 1 The focus loss function of consistency constraint is used to construct the distillation loss function based on the output of the student model and the output of the teacher model. The total loss function is constructed based on the focus loss function and the distillation loss function. The total loss function is used to complete the knowledge distillation from the teacher model to the student model to obtain the trained student model.

[0016] The system acquires the customer question to be matched and performs a preliminary search on it using a medical knowledge base, resulting in several preliminary matching questions. The customer question to be matched and each preliminary matching question are then input into a trained student model, which outputs the corresponding similarity vector. Based on the similarity vectors of all the preliminary matching questions, the system determines the matching question with the highest semantic similarity to the customer question to be matched and outputs the corresponding response statement.

[0017] As a preferred approach, search terms and customer questions from medical scenario dialogues are acquired and used to form a query set. These are then combined with matching questions from a medical knowledge base to construct candidate sample pairs. These candidate sample pairs are then filtered and labeled, and the ratio of positive to negative sample pairs is adjusted to obtain training data. Specifically, this includes:

[0018] The BGE encoder is used to vectorize each customer question in the query set to obtain the corresponding text vector. Based on the text vector, the DBSCAN clustering algorithm is used to deduplicate all customer questions in the query set to obtain the deduplicated query set.

[0019] The medical knowledge base was manually cleaned and then vectorized using a BGE encoder to construct a vector library.

[0020] The approximate nearest neighbor algorithm is used to retrieve the top N matching questions in the vector library that have the highest semantic similarity to the query samples in the deduplicated query set, and form a candidate sample pair set with the corresponding query samples;

[0021] Construct prompt words for judging semantic fit, use a large language model to combine prompt words for judging semantic fit to judge the semantic fit of each candidate sample pair in the candidate sample pair set, and output the label of each candidate sample pair. Keep all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels.

[0022] Then, semantic deduplication is performed on all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels using the BGE encoder and DBSCAN clustering algorithm, followed by manual annotation to obtain the corresponding similarity labels.

[0023] As a preferred option, based on the theme... Figure 1 The focus loss function of consistency constraints is expressed as:

[0024] loss focal-consistency = -α(1-p) γ1 β(1-C) γ2 log(p);

[0025] Where, loss focal-consistency Indicates based on the theme Figure 1 The focus loss function is constrained by consistency, where α and β represent the first balance factor and the second smoothing factor, γ1 and γ2 represent the first modulation factor and the second adjustment factor, respectively, p represents the probability of selecting the corresponding type in the first similarity vector based on the similarity labels of sample pairs in the training data, and C represents the probability of selecting the corresponding type in the first similarity vector based on the similarity labels of the topic and the meaning. Figure 1 The theme determined by the consistency classification results Figure 1 The consistency coefficient is expressed as:

[0026]

[0027] Preferably, sample pairs from the training data are input into the student model to obtain a first projection vector and a first similarity vector, and sample pairs from the training data are input into the teacher model to obtain a second projection vector and a second similarity vector, specifically including:

[0028] The student model includes a first embedding layer and a first feature extraction layer connected in sequence, as well as a first feature projection layer and a first similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the first embedding layer to extract the corresponding first embedding vector. The first embedding vector is then input into the first feature extraction layer to obtain a first feature vector. The first feature vector is then input into the first feature projection layer and the first similarity prediction module to obtain a first projection vector and a first similarity vector, as shown in the following equation:

[0029] h s-proj =project_linear(h s );

[0030] p ss =softmax(mlp(h) s ));

[0031] Among them, h s Let h represent the first eigenvector. s-proj This represents the first projection vector, where project_linear represents the linear layer, and p ss represents the first similarity vector, which has a dimension of 2, and corresponds to the probabilities of similarity and dissimilarity, respectively; mlp represents multilayer perceptron, softmax represents the softmax function, and sigmoid represents the sigmoid activation function;

[0032] The teacher model comprises a second embedding layer and a second feature extraction layer connected in sequence, as well as a second feature projection layer and a second similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the second embedding layer to extract the corresponding second embedding vector. This second embedding vector is then input into the second feature extraction layer to obtain the second feature vector. The second feature vector is then input into the second feature projection layer and the second similarity prediction module to obtain the second projection vector and the second similarity vector, as shown in the following equation:

[0033] h t-proj =project_linear(h t );

[0034] p ts =softmax(mlp(h) t ));

[0035] Among them, h t Let h represent the second eigenvector. t-proj p represents the second projection vector. ts Let represent the second similarity vector, which has a dimension of 2, and represents the probabilities of similarity and dissimilarity, respectively.

[0036] Preferably, a distillation loss function is constructed based on the output results of the student model and the teacher model, specifically including:

[0037] A cross-entropy loss function is constructed based on the first projection feature and the second projection feature, and used as the first distillation loss function, as shown in the following equation:

[0038] loss hidden =MSE(h t-proj ,h s-proj );

[0039] Where, loss hidden Let MSE represent the first distillation loss function, and MSE represent the cross-entropy loss function.

[0040] The cross-entropy loss function is constructed based on the first and second similarity vectors and used as the second distillation loss function, as shown in the following equation:

[0041] loss s_logits =MSE(p ts ,p ss );

[0042] Where, loss s_logits This represents the second distillation loss function.

[0043] As a preferred option, the total loss function is expressed as:

[0044] loss total =μ1loss hidden +μ2loss s_logits +μ3loss focal-consistency ;

[0045] Where, loss total Represents the total loss function, loss focal-consistency Indicates based on the theme Figure 1 The focus loss function of consistency constraints, loss s_logits Denotes the second distillation loss function, loss hidden Let μ1, μ2, and μ3 represent the first distillation loss function, and μ1, μ2, and μ3 represent the corresponding weighting coefficients, respectively.

[0046] Secondly, the present invention provides a semantic matching device for a medical knowledge base with imbalanced positive and negative samples, comprising:

[0047] The training data construction module is configured to acquire search terms and customer questions from medical scenario dialogues and form a query set. It then combines these with matching questions from a medical knowledge base to construct candidate sample pairs. These candidate sample pairs are filtered and labeled, and the ratio of positive to negative sample pairs is adjusted to obtain training data. The module further extracts the topic and intent of each sample pair from the training data and categorizes them based on these elements to obtain the topic and intent classification. Figure 1 Consistency classification results;

[0048] The knowledge distillation module is configured to input sample pairs from the training data into the student model to obtain a first projection vector and a first similarity vector, and input sample pairs from the training data into the teacher model to obtain a second projection vector and a second similarity vector; based on the similarity labels, topics, and meanings of the sample pairs in the training data... Figure 1 Consistency classification results and the construction of the first similarity vector are based on topic meaning Figure 1 The focus loss function of consistency constraint is used to construct the distillation loss function based on the output of the student model and the output of the teacher model. The total loss function is constructed based on the focus loss function and the distillation loss function. The total loss function is used to complete the knowledge distillation from the teacher model to the student model to obtain the trained student model.

[0049] The semantic matching module is configured to acquire the customer question to be matched and perform a preliminary search on it using a medical knowledge base to obtain several preliminary matching questions; input the customer question to be matched and each preliminary matching question into a trained student model, output the corresponding similarity vector, determine the matching question with the highest semantic similarity to the customer question to be matched based on the similarity vectors of all the preliminary matching questions, and output the corresponding response statement.

[0050] Thirdly, the present invention provides an electronic device including one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any implementation of the first aspect.

[0051] Fourthly, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

[0052] Fifthly, the present invention provides a computer program product, including a computer program that, when executed by a processor, implements the method as described in any of the implementations in the first aspect.

[0053] Compared with the prior art, the present invention has the following beneficial effects:

[0054] (1) The semantic matching method for medical knowledge base with imbalanced positive and negative samples proposed in this invention significantly improves the ratio of similar and dissimilar samples in the training data through a multi-level screening process that combines BGE encoder with DBSCAN deduplication, Annoy approximate nearest neighbor algorithm matching and large language model filtering, thereby alleviating the problem of imbalanced training data to a certain extent.

[0055] (2) The semantic matching method for medical knowledge bases with imbalanced positive and negative samples proposed in this invention matches topics with meanings. Figure 1 The dynamic weighting of consistency results amplifies the loss contribution of low-consistency sample pairs, solving the gradient dilution problem of minority class samples in traditional models.

[0056] (3) The semantic matching method for medical knowledge base with imbalanced positive and negative samples proposed in this invention forces the model to pay attention to the implicit entities and demand types unique to the medical field through the dual constraints of topic classification and intent classification. Through the double distillation of the first distillation loss function and the second distillation loss function, the performance of the student model is close to that of the teacher model, which effectively improves the accuracy of semantic matching. Attached Figure Description

[0057] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0058] Figure 1 This is a flowchart illustrating a semantic matching method for a medical knowledge base with an imbalance of positive and negative samples, as described in an embodiment of this application.

[0059] Figure 2 This is a schematic diagram of a semantic matching device for a medical knowledge base with an imbalance of positive and negative samples, as described in an embodiment of this application.

[0060] Figure 3 A schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0061] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0062] Figure 1This application illustrates an embodiment of a semantic matching method for a medical knowledge base with an imbalance of positive and negative samples, comprising the following steps:

[0063] S1. Obtain search terms and customer questions from medical scenario dialogues and form a query set. Combine this with matching questions from the medical knowledge base to construct candidate sample pairs. Filter and label these candidate sample pairs, and adjust the ratio of positive to negative sample pairs to obtain training data. Extract the topic and intent of each sample pair from the training data. Classify the sample pairs in the training data according to their topic and intent to obtain topic and intent... Figure 1 Consistency classification results.

[0064] In a specific embodiment, search terms and customer questions from medical scenario dialogues are acquired and a query set is formed. Candidate sample pairs are constructed by combining these with matching questions from a medical knowledge base. These candidate sample pairs are then filtered and labeled, and the ratio of positive to negative sample pairs is adjusted to obtain training data. Specifically, this includes:

[0065] The BGE encoder is used to vectorize each customer question in the query set to obtain the corresponding text vector. Based on the text vector, the DBSCAN clustering algorithm is used to deduplicate all customer questions in the query set to obtain the deduplicated query set.

[0066] The medical knowledge base was manually cleaned and then vectorized using a BGE encoder to construct a vector library.

[0067] The approximate nearest neighbor algorithm is used to retrieve the top N matching questions in the vector library that have the highest semantic similarity to the query samples in the deduplicated query set, and form a candidate sample pair set with the corresponding query samples;

[0068] Construct prompt words for judging semantic fit, use a large language model to combine prompt words for judging semantic fit to judge the semantic fit of each candidate sample pair in the candidate sample pair set, and output the label of each candidate sample pair. Keep all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels.

[0069] Then, semantic deduplication is performed on all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels using the BGE encoder and DBSCAN clustering algorithm, followed by manual annotation to obtain the corresponding similarity labels.

[0070] Specifically, embodiments of this application extract search terms and customer questions from real medical dialogues on online medical consultation platforms to form an original query set. Each query sample in the query set includes a search term and a customer question. The customer question is semantically vectorized using a BGE encoder, and semantically duplicated samples are removed using a DBSCAN clustering algorithm (ε = 0.03, min_samples = 2), outputting a deduplicated query set.

[0071] A medical knowledge base is acquired, which serves as a QA library. This library contains several question-answer pairs consisting of matching questions and their corresponding answers. During the preparation phase of the medical knowledge base, it is manually cleaned to remove outdated or non-standardized question-answer pairs. A vector library is then constructed from the fields (i.e., Q fields) of the matching questions in the cleaned QA library using a BGE encoder.

[0072] Furthermore, the Annoy approximate nearest neighbor algorithm is used to retrieve the top 5 semantically similar matching questions for each query sample in the deduplicated query set from the vector library, forming a candidate sample pair set. In one embodiment, the Qwen2.5-72B large model is used as a large language model to determine the semantic suitability of the candidate sample pairs in the medical domain. The large language model outputs similar or dissimilar labels for each candidate sample pair in the candidate sample pair set, retaining candidate sample pairs with similar labels and randomly retaining candidate sample pairs with dissimilar labels with a 10% probability. By using the large language model to evaluate and filter the domain suitability of the candidate sample pair set in the above manner, and then using the BGE encoder and DBSCAN clustering algorithm to perform semantic deduplication on the filtered candidate sample pairs, training data with optimized annotation effect is obtained. This method can greatly reduce the cost of manual annotation. Only the training data with optimized annotation effect needs to be manually annotated to obtain training data with a relatively balanced number of positive and negative sample pairs. The ratio of similar to dissimilar training data extracted using this negative sample screening method was increased from 1:9 to 3:7, which alleviated the problem of imbalanced training data to some extent.

[0073] S2, input the sample pairs from the training data into the student model to obtain the first projection vector and the first similarity vector; input the sample pairs from the training data into the teacher model to obtain the second projection vector and the second similarity vector; based on the similarity labels, topics, and meanings of the sample pairs in the training data... Figure 1 Consistency classification results and the construction of the first similarity vector are based on topic meaning Figure 1The focus loss function with consistency constraints is used to construct a distillation loss function based on the output of the student model and the output of the teacher model. The total loss function is constructed based on the focus loss function and the distillation loss function. The total loss function is used to complete the knowledge distillation from the teacher model to the student model, resulting in the trained student model.

[0074] In a specific embodiment, sample pairs from the training data are input into the student model to obtain a first projection vector and a first similarity vector, and sample pairs from the training data are input into the teacher model to obtain a second projection vector and a second similarity vector, specifically including:

[0075] The student model includes a first embedding layer and a first feature extraction layer connected in sequence, as well as a first feature projection layer and a first similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the first embedding layer to extract the corresponding first embedding vector. The first embedding vector is then input into the first feature extraction layer to obtain a first feature vector. The first feature vector is then input into the first feature projection layer and the first similarity prediction module to obtain a first projection vector and a first similarity vector, as shown in the following equation:

[0076] h s-proj =project_linear(h s );

[0077] p ss =softmax(mlp(h) s ));

[0078] Among them, h s Let h represent the first eigenvector. s-proj This represents the first projection vector, where project_linear represents the linear layer, and p ss represents the first similarity vector, which has a dimension of 2, and corresponds to the probabilities of similarity and dissimilarity, respectively; mlp represents multilayer perceptron, softmax represents the softmax function, and sigmoid represents the sigmoid activation function;

[0079] The teacher model comprises a second embedding layer and a second feature extraction layer connected in sequence, as well as a second feature projection layer and a second similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the second embedding layer to extract the corresponding second embedding vector. This second embedding vector is then input into the second feature extraction layer to obtain the second feature vector. The second feature vector is then input into the second feature projection layer and the second similarity prediction module to obtain the second projection vector and the second similarity vector, as shown in the following equation:

[0080] h t-proj =project_linear(ht );

[0081] p ts =softmax(mlp(h) t ));

[0082] Among them, h t Let h represent the second eigenvector. t-proj p represents the second projection vector. ts Let represent the second similarity vector, which has a dimension of 2, and represents the probabilities of similarity and dissimilarity, respectively.

[0083] Specifically, in the embodiments of this application, a lightweight BERT model (such as tiny-BERT, tiny-ERNIE model) is used as the student model. Its structure mainly includes a first embedding layer, a first feature extraction layer, a first feature projection layer, and a first similarity prediction module. The sample pairs in the training data are first input into the first embedding layer to obtain a first embedding vector. The first embedding vector includes a token sequence (input_ids), an attention mask (attention_mask), and a sentence segmentation identifier (token_type_ids). The first embedding vector is input into the first feature extraction layer, which adopts an encoder structure to extract a first feature vector. The first feature vector is then input into the first feature projection layer and the first similarity prediction module to obtain a first projection vector and a first similarity vector. The first feature projection layer is a linear layer.

[0084] Furthermore, a BERT-like model with a large number of parameters (such as the ERNIE-health model) can be used as a teacher model. This teacher model can be a pre-trained model with relatively strong semantic matching capabilities. By performing knowledge distillation between this teacher model and the student model, the student model can learn the semantic matching capabilities of the teacher model. The overall structure of the teacher model is similar to that of the student model, and will not be elaborated further here.

[0085] In a specific embodiment, a distillation loss function is constructed based on the output results of the student model and the teacher model, specifically including:

[0086] A cross-entropy loss function is constructed based on the first projection feature and the second projection feature, and used as the first distillation loss function, as shown in the following equation:

[0087] loss hidden =MSE(h t-proj ,h s-proj );

[0088] Where, loss hidden Let MSE represent the first distillation loss function, and MSE represent the cross-entropy loss function.

[0089] The cross-entropy loss function is constructed based on the first and second similarity vectors and used as the second distillation loss function, as shown in the following equation:

[0090] loss s_logits =MSE(p ts ,p ss );

[0091] Where, loss s_logits This represents the second distillation loss function.

[0092] In specific embodiments, based on the subject matter Figure 1 The focus loss function of consistency constraints is expressed as:

[0093] loss focal-consistency = -α(1-p) γ1 β(1-C) γ2 log(p);

[0094] Where, loss focal-consistency Indicates based on the theme Figure 1 The focus loss function is constrained by consistency, where α and β represent the first balance factor and the second smoothing factor, γ1 and γ2 represent the first modulation factor and the second adjustment factor, respectively, p represents the probability of selecting the corresponding type in the first similarity vector based on the similarity labels of sample pairs in the training data, and C represents the probability of selecting the corresponding type in the first similarity vector based on the similarity labels of the topic and the meaning. Figure 1 The theme determined by the consistency classification results Figure 1 The consistency coefficient is expressed as:

[0095]

[0096] In a specific embodiment, the total loss function is expressed as:

[0097] loss total =μ1loss hidden +μ2loss s_logits +μ3loss focal-consistency ;

[0098] Where, loss total Represents the total loss function, loss focal-consistency Indicates based on the theme Figure 1 The focus loss function of consistency constraints, loss s_logits Denotes the second distillation loss function, loss hidden Let μ1 represent the first distillation loss function.

[0099] μ2 and μ3 represent the corresponding weighting coefficients, respectively.

[0100] Specifically, embodiments of this application integrate topic classification information and intent classification information of sample pairs in the training data to construct a multi-dimensional constraint mechanism, guiding the student model to focus on key semantic features during training. The core idea is that sample pairs with inconsistent topics or intents have low relevance in semantic similarity tasks; therefore, by introducing topic-intent-inconsistent information... Figure 1 Consistency constraints dynamically adjust the weights of sample pairs in the training data within the loss function, thereby improving the student model's generalization ability on imbalanced data such as medical dialogues. The following steps constrain the model's convergence direction:

[0101] (1) Classification information extraction: Extract the topic (text1_topic, text2_topic) and intent (text1_intent, text2_intent) of the input sample pair (text1, text2). The topic contains implicit entity features (such as "dermatology" and "hair repair"), while the intent represents the type of user need (such as "encryption" and "remodeling").

[0102] (2) Based on the degree of matching between the topic and the intent, the sample pairs are divided into three categories to obtain the degree of matching between the topic and the intent. Figure 1 Consistency classification results: Completely consistent (Score=3): Both topic and intent match (text1_topic=text2_topic and text1_intent=text2_intent); Partially consistent (Score=2): Either topic or intent matches (text1_topic=text2_topic or text1_intent=text2_intent); Completely inconsistent (Score=1): Neither topic nor intent matches.

[0103] (3) Hierarchical design of consistency penalty: linking the topic with the meaning Figure 1 Consistency classification results mapped to topic meaning Figure 1 Consistency coefficients (e.g., 3→0.97, 2→0.5, 1→0.03) are introduced into the loss function as constraint signals. By assigning weights, the loss contribution of low-consistency sample pairs (e.g., Score=1 or 2) is amplified, alleviating the data imbalance problem.

[0104] Therefore, by combining topic and intent constraints with focus loss, a topic-intention-based loss function is constructed. Figure 1 The focus loss function of consistency constraints, based on topic meaning Figure 1 In the focus loss function with consistency constraints, the probability of the corresponding dimension in the first similarity vector is selected based on the similarity label corresponding to the sample pair in the training data, and the topic meaning obtained by mapping is used. Figure 1 Consistency coefficient. By introducing a penalty mechanism, probability and topic intent... Figure 1 The larger the consistency coefficient, the smaller its proportion in the focus loss function. The penalty mechanism can make the student model pay more attention to sample pairs with inconsistent topic intent and insufficient model prediction confidence during the training process, thereby optimizing the problem of data imbalance.

[0105] Furthermore, a first distillation loss function is constructed using the first projection vector obtained from the student model and the second projection vector obtained from the teacher model, and a second distillation loss function is constructed using the first similarity vector output by the student model and the second similarity vector output by the teacher model. This will be based on topic meaning... Figure 1 The total loss function is constructed by the focus loss function of consistency constraints, the first distillation loss function, and the second distillation loss function. Minimizing this total loss function enables knowledge distillation of the teacher model and the student model, allowing the trained student model to achieve better semantic matching results.

[0106] S3: Obtain the customer question to be matched and perform a preliminary search on it using the medical knowledge base to obtain several preliminary matching questions; input the customer question to be matched and each preliminary matching question into the trained student model, output the corresponding similarity vector, determine the matching question with the highest semantic similarity to the customer question to be matched based on the similarity vectors corresponding to all preliminary matching questions, and output the corresponding response statement.

[0107] Specifically, the trained student model is deployed. After obtaining the customer question to be matched, it first performs a preliminary search using a medical knowledge base to obtain preliminary matching questions. Then, the customer question to be matched and each preliminary matching question are input into the trained student model to obtain the corresponding similarity vector. This similarity vector is then used to find the matching question with the highest similarity and output its corresponding response.

[0108] Furthermore, the semantic matching method for medical knowledge bases with imbalanced positive and negative samples proposed in the embodiments of this application is compared with the existing semantic matching model based on Bert-base and trained with focal loss on the test sets of thyroid and dermatology departments. The results are shown in Table 1. The semantic matching method for medical knowledge bases with imbalanced positive and negative samples proposed in the embodiments of this application is compared with the existing Text-sim model on the test sets of thyroid and dermatology departments. The results are shown in Table 2. The semantic matching method for medical knowledge bases with imbalanced positive and negative samples proposed in the embodiments of this application has higher accuracy and F1 in different departments, further demonstrating the effectiveness of the present invention.

[0109] Table 1

[0110] accuracy Recall rate F1 Bert-base+focal loss 0.8440 0.5670 0.6783 This invention 0.8701 0.6100 0.7171

[0111] Table 2

[0112] precision recall F1 Text-sim model 0.6311 0.6694 0.7251 This invention 0.8094 0.6034 0.7981

[0113] Further reference Figure 2 As an implementation of the methods shown in the above figures, this application provides an embodiment of a semantic matching device for a medical knowledge base with imbalanced positive and negative samples. This device embodiment is similar to... Figure 1 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0114] This application provides a semantic matching device for a medical knowledge base with imbalanced positive and negative samples, including:

[0115] Training data construction module 1 is configured to acquire search terms and customer questions from medical scenario dialogues and form a query set. It then combines these with matching questions from a medical knowledge base to construct candidate sample pairs. These candidate sample pairs are filtered and labeled, and the ratio of positive to negative sample pairs is adjusted to obtain training data. The module further extracts the topic and intent of each sample pair from the training data, classifying them according to their topic and intent to obtain the topic and intent... Figure 1 Consistency classification results;

[0116] Knowledge distillation module 2 is configured to input sample pairs from the training data into the student model to obtain a first projection vector and a first similarity vector, and input sample pairs from the training data into the teacher model to obtain a second projection vector and a second similarity vector; based on the similarity labels, topics, and meanings of the sample pairs in the training data... Figure 1 Consistency classification results and the construction of the first similarity vector are based on topic meaning Figure 1 The focus loss function of consistency constraint is used to construct the distillation loss function based on the output of the student model and the output of the teacher model. The total loss function is constructed based on the focus loss function and the distillation loss function. The total loss function is used to complete the knowledge distillation from the teacher model to the student model to obtain the trained student model.

[0117] Semantic matching module 3 is configured to acquire the customer question to be matched and perform a preliminary search on it using a medical knowledge base to obtain several preliminary matching questions; input the customer question to be matched and each preliminary matching question into the trained student model, output the corresponding similarity vector, determine the matching question with the highest semantic similarity to the customer question to be matched based on the similarity vectors of all the preliminary matching questions, and output the corresponding response statement.

[0118] Figure 3 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present invention. For example... Figure 3 As shown, the electronic device in this embodiment includes a processor 301 and a memory 302; wherein the memory 302 is used to store computer execution instructions; and the processor 301 is used to execute the computer execution instructions stored in the memory to implement the various steps performed by the electronic device in the above embodiment. For details, please refer to the relevant descriptions in the foregoing method embodiments.

[0119] Alternatively, the memory 302 can be either standalone or integrated with the processor 301.

[0120] When the memory 302 is set up independently, the electronic device also includes a bus 303 for connecting the memory 302 and the processor 301.

[0121] This invention also provides a computer storage medium storing computer execution instructions, which, when executed by processor 301, implement the above method.

[0122] This invention also provides a computer program product, including a computer program that, when executed by a processor 301, implements the above-described method.

[0123] In the embodiments provided by this invention, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.

[0124] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to implement the solution of this embodiment according to actual needs.

[0125] Furthermore, the functional modules in the various embodiments of this invention can be integrated into one processing unit, or each module can exist physically separately, or two or more modules can be integrated into one unit. The unit formed by the above modules can be implemented in hardware or in the form of hardware plus software functional units.

[0126] The integrated modules implemented as software functional modules described above can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor 301 to execute some steps of the methods of the various embodiments of this application.

[0127] It should be understood that the processor 301 described above can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor, or the processor 301 can be any conventional processor 301. The steps of the method disclosed in this invention can be directly manifested as the hardware processor 301 executing the steps, or as a combination of hardware and software modules within the processor 301 executing the steps.

[0128] The memory 302 may include high-speed RAM memory, and may also include non-volatile memory (NVM), such as at least one disk storage device, and may also be a USB flash drive, portable hard drive, read-only memory, disk or optical disc, etc.

[0129] Bus 303 can be an Industry Standard Architecture (ISA), a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Bus 303 can be divided into address bus, data bus, control bus, etc. For ease of illustration, the bus 303 in the accompanying drawings of this application is not limited to only one bus 303 or one type of bus 303.

[0130] The aforementioned storage medium can be implemented from any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The storage medium can be any available medium accessible to general-purpose or special-purpose computers.

[0131] An exemplary storage medium is coupled to a processor 301, enabling the processor 301 to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor 301. The processor 301 and the storage medium can reside in an application-specific integrated circuit (ASIC). Alternatively, the processor 301 and the storage medium can exist as discrete components in an electronic device or a host device.

[0132] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.

[0133] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A semantic matching method for medical knowledge bases with imbalanced positive and negative samples, characterized in that, Includes the following steps: The search terms and customer questions in medical scenario dialogues are obtained and formed into a query set. Candidate sample pairs are constructed by combining them with matching questions in the medical knowledge base. The candidate sample pairs are screened and labeled, and the ratio of positive sample pairs to negative sample pairs is adjusted to obtain training data. Extract the topic and intent of the sample pairs in the training data respectively, and classify the sample pairs in the training data according to the topic and intent to obtain the topic and intent consistency classification result; The sample pairs in the training data are input into the student model to obtain a first projection vector and a first similarity vector. The sample pairs in the training data are input into the teacher model to obtain a second projection vector and a second similarity vector. Based on the similarity labels of the sample pairs in the training data, the classification results of topic and intent consistency, and the first similarity vector, a focus loss function based on topic and intent consistency constraints is constructed. The focus loss function based on topic and intent consistency constraints is expressed as follows: ; in, This represents the focus loss function based on the topic intent consistency constraint. and This represents the first balance factor and the second smoothing factor. and These represent the first modulation factor and the second adjustment factor, respectively. This represents the probability of selecting the corresponding type from the first similarity vector based on the similarity labels of sample pairs in the training data. The topic-intent consistency coefficient, determined based on the classification results, is expressed as: ; A distillation loss function is constructed based on the output results of the student model and the output results of the teacher model. A total loss function is constructed based on the focus loss function and the distillation loss function. The total loss function is used to complete the knowledge distillation from the teacher model to the student model, resulting in a trained student model. The system obtains the customer question to be matched and performs a preliminary search on it using the medical knowledge base to obtain several preliminary matching questions. The system inputs the customer question to be matched and each preliminary matching question into the trained student model and outputs the corresponding similarity vector. Based on the similarity vectors of all the preliminary matching questions, the system determines the matching question with the highest semantic similarity to the customer question to be matched and outputs the corresponding response statement.

2. The semantic matching method for medical knowledge bases with imbalanced positive and negative samples according to claim 1, characterized in that, The process involves acquiring search terms and customer questions from medical scenario dialogues to form a query set, constructing candidate sample pairs by combining them with matching questions from a medical knowledge base, filtering and labeling these candidate sample pairs, and adjusting the ratio of positive to negative sample pairs to obtain training data. Specifically, this includes: Each customer question in the query set is vectorized using a BGE encoder to obtain a corresponding text vector. Based on the text vector, the DBSCAN clustering algorithm is used to deduplicate all customer questions in the query set to obtain a deduplicated query set. The medical knowledge base is manually cleaned and vectorized using a BGE encoder to construct a vector library; The approximate nearest neighbor algorithm is used to retrieve the top N matching questions in the vector library that have the highest semantic similarity to the query samples in the deduplicated query set, and form a candidate sample pair set with the corresponding query samples; Construct prompt words for judging semantic fit, use a large language model in combination with the prompt words for judging semantic fit to judge the semantic fit of each candidate sample pair in the candidate sample pair set, and output the label of each candidate sample pair, retaining all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels; Then, semantic deduplication is performed on all candidate sample pairs with similar labels and some candidate sample pairs with dissimilar labels using the BGE encoder and DBSCAN clustering algorithm, followed by manual annotation to obtain the corresponding similarity labels.

3. The medical knowledge base semantic matching method for positive and negative sample imbalance according to claim 1, characterized in that, The sample pairs in the training data are input into the student model to obtain a first projection vector and a first similarity vector. The sample pairs in the training data are input into the teacher model to obtain a second projection vector and a second similarity vector. Specifically, this includes: The student model includes a first embedding layer and a first feature extraction layer connected in sequence, as well as a first feature projection layer and a first similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the first embedding layer to extract the corresponding first embedding vector. The first embedding vector is then input into the first feature extraction layer to obtain a first feature vector. The first feature vector is then input into the first feature projection layer and the first similarity prediction module to obtain a first projection vector and a first similarity vector, as shown in the following equation: ; ; in, Represents the first eigenvector. Represents the first projection vector. Indicates a linear layer. Let represent the first similarity vector, which has a dimension of 2, and represents the probabilities of similarity and dissimilarity, respectively. This represents a multilayer perceptron. express function; The teacher model includes a second embedding layer and a second feature extraction layer connected in sequence, as well as a second feature projection layer and a second similarity prediction module set in parallel. Sample pairs from the training data are input into the student model, first passing through the second embedding layer to extract the corresponding second embedding vector. The second embedding vector is then input into the second feature extraction layer to obtain a second feature vector. This second feature vector is then input into the second feature projection layer and the second similarity prediction module to obtain a second projection vector and a second similarity vector, as shown in the following equation: ; ; wherein, denotes a second feature vector, denotes a second projection vector, denotes a second similarity vector, which has a dimension of 2, corresponding to the probability of similarity and the probability of dissimilarity, respectively.

4. The medical knowledge base semantic matching method for positive and negative sample imbalance according to claim 3, characterized in that, Based on the aforementioned student and teacher models, a distillation loss function is constructed, specifically including: Based on the first and second projection vectors, a cross-entropy loss function is constructed and used as the first distillation loss function, as shown in the following equation: ; in, Let MSE represent the first distillation loss function, and MSE represent the cross-entropy loss function. The cross-entropy loss function is constructed based on the first similarity vector and the second similarity vector, and used as the second distillation loss function, as shown in the following equation: ; wherein, denotes the second distillation loss function.

5. The semantic matching method for medical knowledge bases with imbalanced positive and negative samples according to claim 1, characterized in that, The total loss function is expressed as: ; in, Represents the total loss function. This represents the focus loss function based on the topic intent consistency constraint. This represents the second distillation loss function. This represents the first distillation loss function. , and These represent the corresponding weighting coefficients.

6. A medical knowledge base semantic matching device for positive and negative sample imbalance, comprising: include: The training data construction module is configured to acquire search terms and customer questions in medical scenario dialogues and form a query set, combine them with matching questions in the medical knowledge base to construct candidate sample pairs, filter and label the candidate sample pairs, and adjust the ratio of positive sample pairs to negative sample pairs to obtain training data. Extract the topic and intent of the sample pairs in the training data respectively, and classify the sample pairs in the training data according to the topic and intent to obtain the topic and intent consistency classification result; The knowledge distillation module is configured to input sample pairs from the training data into the student model to obtain a first projection vector and a first similarity vector, and input sample pairs from the training data into the teacher model to obtain a second projection vector and a second similarity vector; based on the similarity labels of the sample pairs in the training data, the topic and intent consistency classification results, and the first similarity vector, a focus loss function based on topic and intent consistency constraints is constructed, wherein the focus loss function based on topic and intent consistency constraints is expressed as follows: ; in, This represents the focus loss function based on the topic intent consistency constraint. and This represents the first balance factor and the second smoothing factor. and These represent the first modulation factor and the second adjustment factor, respectively. This represents the probability of selecting the corresponding type from the first similarity vector based on the similarity labels of sample pairs in the training data. The topic-intent consistency coefficient, determined based on the classification results, is expressed as: Based on the output results of the student model and the teacher model, a distillation loss function is constructed. Based on the focus loss function and the distillation loss function, a total loss function is constructed. The total loss function is used to complete the knowledge distillation from the teacher model to the student model, resulting in a trained student model. The semantic matching module is configured to acquire the customer question to be matched and perform a preliminary search on it using the medical knowledge base to obtain several preliminary matching questions; input the customer question to be matched and each preliminary matching question into the trained student model, output the corresponding similarity vector, determine the matching question with the highest semantic similarity to the customer question to be matched based on the similarity vectors corresponding to all the preliminary matching questions, and output the corresponding response statement.

7. An electronic device, comprising: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-5.

8. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-5.

9. A computer program product comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-5.