Intelligent voice interaction service system for doctor-patient based on semantic understanding
By constructing a doctor-patient intelligent voice interaction system based on semantic understanding, the problem of lack of semantic understanding in existing technologies has been solved. It has achieved accurate recognition of user intent, emotion and context, improved the efficiency and accuracy of doctor-patient communication, and enhanced the quality of medical services.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU FEIXIN TECH CO LTD
- Filing Date
- 2025-08-21
- Publication Date
- 2026-06-16
Smart Images

Figure CN121034302B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence and healthcare, specifically a doctor-patient intelligent voice interaction service system based on semantic understanding. Background Technology
[0002] In traditional medical services, doctor-patient communication mainly relies on face-to-face interaction or written records. However, this approach has many problems. On the one hand, doctors are busy, and manually recording information is inefficient and prone to overlooking important details. On the other hand, patients may be unable to accurately convey their symptoms and needs due to nervousness or limited expressive abilities. With the development of artificial intelligence technology, intelligent voice interaction technology is gradually being applied to the medical field. However, most existing voice interaction systems are based on keyword matching or simple rules, lacking a deep understanding of semantics, resulting in unsatisfactory interaction effects.
[0003] For example, Chinese patent application CN118430846A discloses a method and system for intelligent voice follow-up of patients outside of hospitals. This includes obtaining patient visit information from HIS records via an interface and generating a follow-up plan. The follow-up plan includes automatically configuring a corresponding follow-up path based on the patient's department and disease type. The follow-up path can include AI voice follow-up or manual follow-up. The AI voice follow-up automatically generates voice follow-up content based on the patient's department and disease type and makes batch outbound calls to the patient. It then determines whether the current batch outbound voice follow-up calls are proceeding normally and integrates voice follow-up tasks based on the determination result. Finally, it executes the voice follow-up tasks and generates a statistical table for managing the patient's voice follow-up data. This technical solution establishes an intelligent voice follow-up system for discharged patients based on intelligent voice and artificial intelligence core technologies, developing a new human-machine coupled doctor-patient interaction service model. Ultimately, it visualizes the interaction results and records the entire process for easy traceability.
[0004] The existing technologies mentioned above all have the following problems: limited functionality and scope of application; lack of deep understanding of user intent, emotions and context; lack of dynamic knowledge retrieval and reasoning capabilities; and lack of flexible multi-turn dialogue management and personalized response generation capabilities. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention proposes a semantic understanding-based intelligent voice interaction service system for doctors and patients. The system receives voice input and performs noise reduction processing, converting it into text data using a speech recognition model. It then constructs a semantic understanding model based on semantic networks and knowledge graphs to perform semantic analysis on the text data, identifying user intent, emotion, and contextual information. Based on the semantic analysis results, it performs semantic retrieval in a medical knowledge base, combining the reasoning capabilities of the knowledge graph to deduce diagnostic results or treatment plans. A dialogue strategy is generated, and a generative model is used to generate natural language responses. The text responses are converted into natural speech using speech synthesis technology, adjusting tone, speed, and emotional expression according to the user's emotional state. User feedback is collected to optimize system performance. This invention improves the efficiency and accuracy of doctor-patient interaction.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] The medical-patient intelligent voice interaction service system based on semantic understanding includes: a voice input module, a semantic understanding module, a knowledge base module, a dialogue management module, and a speech synthesis module;
[0008] The voice input module includes a noise reduction processing unit and a speech-to-text unit; the noise reduction processing unit is equipped with an adaptive noise cancellation algorithm, which is used to reduce the noise of the input voice; the speech-to-text unit is equipped with a speech recognition model, which is used to convert the noise-reduced voice into text data.
[0009] The semantic understanding module is used to perform semantic analysis on text data using a constructed semantic understanding model based on semantic networks and knowledge graphs, and to identify the user's intent, emotion and contextual information;
[0010] The knowledge base module includes a semantic retrieval unit and a knowledge reasoning unit. The semantic retrieval unit is used to perform semantic retrieval in the medical knowledge base using a combination of semantic similarity matching and logical reasoning to obtain preliminary matching results. The knowledge reasoning unit uses the reasoning capabilities of the knowledge graph, combined with user input and contextual information, to deduce diagnostic results or treatment plans.
[0011] The dialogue management module is used to combine semantic analysis results and knowledge base query results to generate dialogue strategies and generate natural language responses.
[0012] The speech synthesis module is used to convert the generated text response into natural speech, adjust the tone, speed and emotional expression of the synthesized speech in combination with the user's emotional state, and output it to the user.
[0013] Specifically, the steps of the adaptive noise cancellation algorithm include:
[0014] A1: Acquire speech signals and reference noise signals, and perform time synchronization and alignment; the speech signals include target speech and noise;
[0015] A2: The aligned speech signal d(n) and the reference noise signal s(n) are processed into frames to obtain speech signal d(n) and reference noise signal s(n) in B time frames. Windowing is applied to each frame signal, where n represents the sampling point.
[0016] A3: In each frame of the signal, initialize the adaptive filter's weight vector W(0) = [w0(0), w1(0), ..., w L-1 (0)] T And set the step size factor μ, where w L-1 (0) represents the (L-1)th weight coefficient in the adaptive filter weight vector, where L represents the length of the adaptive filter and T represents the transpose.
[0017] A4: Based on the reference noise signal s(n), construct the delay vector S(n) = [s(n), s(n-1), ..., s(n-L+1)] of the reference noise signal. T , where s(n-L+1) represents the sampled value of the reference noise signal at time point (n-L+1);
[0018] A5: Use an adaptive filter to filter the delay vector S(n) of the reference noise signal to obtain the estimated noise signal.
[0019] Specifically, the adaptive noise cancellation algorithm further includes the following steps:
[0020] A6: By calculating the speech signal d(n) for each frame and the estimated noise signal The difference is used to obtain the estimated value e of the clean speech signal in the i-th frame. i (n);
[0021] A7: Overlap and sum the estimated values of the clean speech signal processed by frame segmentation to obtain the reconstructed clean speech signal e(n);
[0022] A8: Set the convergence threshold h1, and update the adaptive filter weight vector W(n+1) using an adaptive algorithm based on the adaptive filter weight vector, step size factor, the delay vector of the reconstructed clean speech signal and the reference noise signal.
[0023] A9: Perform iterative repetition. If MSE(e(n)) < h1, dynamically adjust the step size factor and output the final filter weight vector. The estimated value of the clean speech signal in the i-th frame and clean voice signal Where MSE(·) represents the mean square error function.
[0024] Specifically, the steps of the speech-to-text unit include:
[0025] B1: Obtain the estimated value of the clean speech signal in the i-th frame.
[0026] B2: Convert each frame of clean speech signal to the frequency domain using short-time Fourier transform to obtain the spectrum E of each frame of clean speech signal. i (k) and power spectrum |E i (k)| 2 ;
[0027] B3: Design a set of Mel filters H m (k) represents the power spectrum |E| of each clean speech signal frame. i (k)| 2 With Mel filter H m (k) Multiply and sum to obtain the Mel filter bank energy Q(m), where m represents the Mel filter index;
[0028] B4: Take the logarithm of Q(m) and perform a discrete cosine transform to obtain the l-th Mel frequency cepstral coefficient α(l), where l represents the index of the Mel frequency cepstral coefficient.
[0029] Specifically, the steps of the speech-to-text unit further include:
[0030] B5: Load the pre-trained acoustic model, organize the Mel frequency cepstral coefficients α(l) into a time series A = [α(1),...,α(B)], and input the time series A = [α(1),...,α(B)] into the pre-trained acoustic model. Output the phoneme probability distribution P(p) for each frame through the forward propagation of the acoustic model. b |A), where p b Let α(B) represent the b-th phoneme, b represent the phoneme index, A represent the Mel frequency cepstral coefficient sequence, α(B) represent the Mel frequency cepstral coefficient of the B-th frame, and B represent the time frame number; the acoustic model is configured based on a recurrent neural network model.
[0031] B6: The phoneme probability distribution P(p b |A) Organized as a phoneme sequence And The input is fed into a preloaded neural network language model to calculate the context-dependent probability of each phoneme;
[0032] B7: Use the context-dependent probability of each phoneme to plot the phoneme probability distribution P(p) of the acoustic model. b|A) Perform correction, obtain and output the corrected phoneme sequence probability P(p b |A,LM), where LM represents a neural network language model;
[0033] B8: Based on the output of the acoustic model and the neural network language model, the phoneme sequence is decoded using the beam search algorithm, the decoded phoneme sequence is converted into a text sequence, and the text sequence is output.
[0034] Specifically, the steps of B8 include:
[0035] B8.1: Set the beamwidth parameter λ and initialize a candidate phoneme sequence set D; the beamwidth parameter is the number of candidate sequences retained at each time step t;
[0036] B8.2: At each time step t, new candidate phoneme sequences are expanded from the current candidate phoneme sequence set D, and the phoneme probability distribution P(p b The logarithm of |A), plus the phoneme sequence probability P(p b The product of the logarithm of |A,LM) and the weight coefficient β of the neural network language model yields each new candidate phoneme sequence p. 1:b Score (p) 1:b ).
[0037] Specifically, the steps of B8 further include:
[0038] B8.3: For each time step t, update the candidate phoneme sequence set to obtain the updated candidate phoneme sequence set.
[0039] B8.4: If the updated candidate phoneme sequence set Reaching the preset maximum time length Then stop the search;
[0040] B8.5: From The phoneme sequence with the highest score is selected as the final phoneme sequence. At the same time, the final phoneme sequence p * Mapped to a text sequence Text(p) * ), and output the final result, where, Indicates from the set of candidate phoneme sequences Find the phoneme sequence p that maximizes Score(p). * .
[0041] Specifically, the implementation steps of the semantic understanding module include:
[0042] C1: Define entities and relations in a semantic network;
[0043] C2: A knowledge graph is constructed using structured and unstructured data and stored in a graph database. The structured data is obtained from medical literature and clinical guidelines, while the unstructured data is obtained from patient medical records. The graph database uses Neo4j.
[0044] C3: Load a pre-trained language model as the base model and fine-tune it using domain data;
[0045] C4: For the input text sequence Text(p) * Semantic analysis is performed to identify the user's intent, emotions, and contextual information.
[0046] Specifically, the implementation steps of the semantic retrieval unit include:
[0047] D1: Based on the semantic analysis results in C4, generate the query statement R = f(I, U), where I represents the user intent, U represents the key entity, and f represents the query generation function;
[0048] D2: Calculate the semantic similarity between query statement R and the content in the knowledge base to obtain preliminary matching results;
[0049] D3: Utilize the reasoning capabilities of the knowledge graph, combined with user input and contextual information, to obtain the reasoning result F; the reasoning result includes a diagnosis or treatment plan.
[0050] D4: Sort the preliminary matching results and inference results, and add the inference score Inference(F) to the cosine similarity between the query statement R and the inference result F, combined with the weight coefficient η, to return the inference result with the highest score.
[0051] Specifically, the implementation steps of the speech synthesis module include:
[0052] E1: Preprocess the text responses generated by the system, including word segmentation, punctuation processing, and text normalization;
[0053] E2: Analyze the user's emotional state to determine the tone, speed, and emotional expression of the synthesized speech;
[0054] E3: Load the pre-trained deep learning-based speech synthesis model and input the pre-processed text response into the trained deep learning-based speech synthesis model to generate natural speech;
[0055] E4: Adjust the tone, speed, and emotional expression of the synthesized speech based on the user's emotional state, and output the adjusted speech to the user.
[0056] Compared with the prior art, the beneficial effects of the present invention are:
[0057] 1. This invention proposes a doctor-patient intelligent voice interaction service system based on semantic understanding, and optimizes and improves its architecture, operation steps and processes. The system has the advantages of simple process, low investment and operation costs and low production and operation costs.
[0058] 2. This invention proposes a doctor-patient intelligent voice interaction service system based on semantic understanding. In the voice interaction process, the voice input is denoised by an adaptive noise cancellation algorithm, which can effectively improve the accuracy of voice recognition and make the voice more clearly converted into text data. The semantic understanding model based on semantic network and knowledge graph can accurately identify user intent, emotion and context information, making the interaction more intelligent and natural, and improving communication efficiency and fluency.
[0059] 3. This invention proposes a medical-patient intelligent voice interaction service system based on semantic understanding. From the perspective of medical services, this method uses semantic similarity matching and logical reasoning to retrieve information from a medical knowledge base based on semantic analysis results. It then combines knowledge graph reasoning capabilities to deduce diagnostic results or treatment plans and ranks the results to ensure the provision of the most relevant medical knowledge. The generated dialogue strategy, combined with a generative model, generates natural language responses, which are then output through speech synthesis and adjusted according to the user's emotions, thus improving the patient's medical experience. At the same time, user feedback is collected to optimize system performance, continuously improve services, and promote the continuous improvement of the quality and efficiency of medical services. Attached Figure Description
[0060] Figure 1 This is a diagram illustrating the architecture of the doctor-patient intelligent voice interaction service system based on semantic understanding, as described in this invention.
[0061] Figure 2 This is a flowchart illustrating the principle of the medical-patient intelligent voice interaction service system based on semantic understanding, as described in this invention.
[0062] Figure 3 This is a flowchart illustrating the implementation of the adaptive noise cancellation algorithm in the semantic understanding-based intelligent voice interaction service system for doctors and patients according to the present invention. Detailed Implementation
[0063] Example 1
[0064] Please see Figure 1 and Figure 2 One embodiment of the present invention provides a medical-patient intelligent voice interaction service system based on semantic understanding, comprising:
[0065] The system includes a voice input module, a semantic understanding module, a knowledge base module, a dialogue management module, a speech synthesis module, and a user feedback module.
[0066] The voice input module is used to receive voice signals, perform noise reduction processing, and convert voice into text data to provide input for subsequent semantic analysis;
[0067] The semantic understanding module is used to perform semantic analysis on the converted text data, identify the user's intent, sentiment and contextual information, and provide a basis for knowledge base query and dialogue strategy generation.
[0068] The knowledge base module is used to generate query statements based on semantic analysis results, and to perform semantic retrieval and reasoning in the medical knowledge base to obtain medical knowledge and information;
[0069] The dialogue management module is used to combine semantic analysis results and knowledge base query results to generate dialogue strategies and natural language responses;
[0070] The dialogue strategy is a core component of the intelligent voice interaction system, responsible for generating appropriate system behavior and natural language responses based on user intent, contextual information, and knowledge base query results. Its content includes user intent understanding, context management, knowledge base query results, system behavior decision-making, response generation, multi-turn dialogue management, error handling and recovery, and user feedback and strategy optimization. Through a reasonable dialogue strategy, the system can achieve efficient and smooth dialogue interaction. Furthermore, the dialogue strategy is implemented using a statistical method, which is existing technology in this field and not an inventive solution of this application; therefore, it will not be elaborated upon here.
[0071] The speech synthesis module is used to convert the generated text response into natural speech, adjust the tone, speed and emotional expression of the synthesized speech based on the user's emotional state, and output it to the user.
[0072] The user feedback module is used to collect user feedback on the interaction results and optimize the performance of modules such as semantic understanding model, knowledge base query and reasoning, dialogue strategy generation and speech synthesis.
[0073] The voice input module includes: a voice receiving unit, a noise reduction processing unit, and a speech-to-text unit;
[0074] The voice receiving unit is used to receive voice input from a microphone, telephone or mobile device, and supports multiple audio formats such as WAV and MP3.
[0075] The noise reduction processing unit is used to perform noise reduction processing on speech using an adaptive noise cancellation algorithm to improve speech quality;
[0076] The speech-to-text unit is used to convert noise-reduced speech into text data using end-to-end speech recognition models, such as Whisper, and supports multilingual and dialect recognition.
[0077] The semantic understanding module includes: semantic network construction unit, knowledge graph construction unit, intent recognition unit, entity extraction unit, sentiment analysis unit, and context understanding unit;
[0078] Semantic network building units are used to construct semantic networks in the medical field, covering entities such as diseases, symptoms, drugs, and treatment plans, as well as their relationships.
[0079] The knowledge graph construction unit is used to build medical knowledge graphs using medical literature, clinical guidelines and patient medical record data, supporting efficient semantic query and reasoning;
[0080] The intent recognition unit is used to identify user intent through a classification model, such as disease consultation, appointment registration, and medication guidance.
[0081] The entity extraction unit is used to extract key entities from text, such as disease names, symptoms, and drugs, using named entity recognition technology.
[0082] The sentiment analysis unit is used to identify the user's emotional state, such as anxiety, calmness, and eagerness, using a sentiment analysis model.
[0083] The context understanding unit is used to record the dialogue history through a context memory mechanism, dynamically adjust the semantic analysis results, and ensure the coherence of multi-turn dialogues.
[0084] The knowledge base module includes: a query statement generation unit, a semantic retrieval unit, a knowledge reasoning unit, and a result sorting unit;
[0085] The query statement generation unit is used to generate query statements related to user intent based on the semantic analysis results.
[0086] The semantic retrieval unit is used to retrieve content related to the query statement from the medical knowledge base using a semantic similarity matching method, and obtain preliminary matching results;
[0087] The knowledge reasoning unit is used to deduce possible diagnostic results or treatment plans by leveraging the reasoning capabilities of the knowledge graph and combining user input and contextual information.
[0088] The results sorting unit is used to sort the preliminary matching results and inference results, and prioritizes returning the medical knowledge and information that are most relevant to the user's intent.
[0089] The speech synthesis module includes: a speech synthesis model unit, an emotion-based speech synthesis unit, and a speech output unit;
[0090] The speech synthesis model unit is used to train a deep learning-based speech synthesis model, combined with speech data from the medical field.
[0091] The emotional speech synthesis unit is used to adjust the tone, speed and emotional expression of the synthesized speech according to the user's emotional state, thereby improving the user experience;
[0092] The voice output unit is used to convert the generated text response into natural speech and output it to the user.
[0093] In summary, the overall implementation process of a semantic understanding-based intelligent voice interaction service system for doctors and patients includes:
[0094] S1: Receive voice input, use an adaptive noise cancellation algorithm to denoise the input voice, and use a speech recognition model to convert the denoised voice into text data.
[0095] S2; Construct a semantic understanding model based on semantic networks and knowledge graphs, and use the semantic understanding model to perform semantic analysis on text data to identify user intent, sentiment and contextual information;
[0096] S3; Based on the semantic analysis results, generate a query statement and use a combination of semantic similarity matching and logical reasoning to perform semantic retrieval in the medical knowledge base to obtain preliminary matching results. At the same time, using the reasoning ability of the knowledge graph, combined with user input and context information, deduce the diagnosis result or treatment plan, and sort the preliminary matching results and deduction results, prioritizing the return of medical knowledge and information that are most relevant to the user's intent.
[0097] S4; Combine semantic analysis results and knowledge base query results to generate a dialogue strategy, and use a generative model to generate a natural language response;
[0098] S5; Convert the system-generated text reply into natural speech using speech synthesis technology, adjust the tone, speed, and emotional expression of the synthesized speech based on the user's emotional state, and output it to the user;
[0099] S6; Collect user feedback on interaction results and optimize system performance.
[0100] For example, a user asks a question through the voice input module, such as, "I've been having frequent headaches lately, what could be the cause?" The voice input module converts the speech into text and passes it to the semantic understanding module. The semantic understanding module analyzes the text, identifies the user's intent as a medical consultation, and determines the key entity as "headache." Based on the intent and entity information, the dialogue management module retrieves relevant content from the knowledge base to generate a response text, such as, "Headaches can be caused by various reasons, such as migraines, tension headaches, etc. We suggest you get further checked." The speech synthesis module converts the response text into speech and outputs it to the user. The user feedback module collects user satisfaction with the response to optimize system performance.
[0101] Example 2
[0102] Please see Figure 3 The specific steps of the adaptive noise cancellation algorithm in this embodiment include:
[0103] A1: Acquire speech signals and reference noise signals, and perform time synchronization and alignment; the speech signals include target speech and noise;
[0104] Specifically, the voice signal is acquired through the main microphone, and the reference noise signal is acquired through the reference microphone.
[0105] A2: The aligned speech signal d(n) and the reference noise signal s(n) are processed into frames to obtain speech signal d(n) and reference noise signal s(n) in B time frames. Windowing is applied to each frame signal, where n represents the sampling point.
[0106] Windowing refers to multiplying each frame of the signal by a window function. In this application, when the speech signal is processed by framing, the boundaries between frames become discontinuous, which introduces high-frequency components into the frequency domain and causes spectral leakage. Spectral leakage will make the spectral analysis results of the signal inaccurate and affect speech recognition and noise reduction. Therefore, windowing is required. By multiplying each frame of the signal by a window function, the boundaries of the signal are smoothed, the discontinuity between frames is reduced, and spectral leakage is suppressed.
[0107] A3: In each frame of the signal, initialize the adaptive filter's weight vector W(0) = [w0(0), w1(0), ..., w L-1 (0)] T And set the step size factor μ, where w L-1 (0) represents the (L-1)th weight coefficient in the adaptive filter weight vector, where L represents the length or order of the adaptive filter, and T represents the transpose.
[0108] A4: Based on the reference noise signal s(n), construct the delay vector S(n) = [s(n), s(n-1), ..., s(n-L+1)] of the reference noise signal. T , where s(n-L+1) represents the sampled value of the reference noise signal at time point (n-L+1).
[0109] A5: Use an adaptive filter to filter the delay vector S(n) of the reference noise signal to obtain the estimated noise signal. Where W(n) represents the weight vector of the adaptive filter;
[0110] A6: By calculating the speech signal d(n) for each frame and the estimated noise signal The difference is used to obtain an estimate of the clean speech signal in the i-th frame. Where, d i (n) represents the audio signal of the i-th frame. This represents the noise signal estimated for the i-th frame;
[0111] A7: Overlap and sum the estimated values of the clean speech signal processed by frame segmentation to obtain the reconstructed clean speech signal.
[0112] A8: Set the convergence threshold h1, and update the adaptive filter weight vector W(n+1)=W(n)+μ×e(n)×S(n) using the adaptive algorithm based on the adaptive filter weight vector, step size factor, the delay vector of the reconstructed clean speech signal and the reference noise signal.
[0113] A9: Perform iterative repetition. If MSE(e(n)) < h1, dynamically adjust the step size factor and output the final filter weight vector. The estimated value of the clean speech signal in the i-th frame and clean voice signal Where MSE(·) represents the mean square error function.
[0114] The specific steps of the speech-to-text unit include:
[0115] B1: Obtain the estimated value of the clean speech signal in the i-th frame.
[0116] B2: Convert each frame of clean speech signal to the frequency domain using short-time Fourier transform to obtain the spectrum E of each frame of clean speech signal. i (k) and power spectrum |E i (k)| 2 The short-time Fourier transform is a prior art in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0117] B3: Design a set of Mel filters H m (k) represents the power spectrum |E| of each clean speech signal frame. i (k)| 2 With Mel filter H m (k) Multiply and sum to obtain the energy of the Mel filter bank. Where m represents the Mel filter index, N represents the length of each frame, and the design steps of the Mel filter are existing technology in this field and are not the inventive solution of this application, and will not be described in detail here;
[0118] B4: Take the logarithm of Q(m) and perform a discrete cosine transform to obtain the l-th Mel frequency cepstral coefficient. Where l represents the index of the Mel frequency cepstral coefficients, M represents the number of Mel filter banks, cos(·) represents the cosine function, and log(·) represents the logarithmic function;
[0119] B5: Load the pre-trained acoustic model, organize the Mel frequency cepstral coefficients α(l) into a time series A = [α(1),...,α(B)], and input the time series A = [α(1),...,α(B)] into the pre-trained acoustic model. Output the phoneme probability distribution P(p) for each frame through the forward propagation of the acoustic model. b |A), where p b Let α(B) represent the b-th phoneme, b represent the phoneme index, A represent the Mel frequency cepstral coefficient sequence, α(B) represent the Mel frequency cepstral coefficient of the B-th frame, and B represent the time frame number; the acoustic model is configured based on a recurrent neural network model, wherein the recurrent neural network model is prior art in this field and is not an inventive solution of this application, and will not be described in detail here;
[0120] B6: The phoneme probability distribution P(p b |A) Organized as a phoneme sequence And The input is fed into a preloaded neural network language model to calculate the context-dependent probability of each phoneme. Where W0 represents the weight matrix of the output layer of the neural network language model, h represents the state function of the hidden layer of the neural network language model, Softmax(·) represents the activation function, ε0 represents the bias term of the output layer of the neural network language model, and the neural network language model is the prior art in this field and is not an inventive solution of this application, and will not be described in detail here.
[0121] B7: Use the context-dependent probability of each phoneme to plot the phoneme probability distribution P(p) of the acoustic model. b |A) Perform correction, obtain and output the corrected phoneme sequence probability P(p b The formula for calculating the probability of the corrected phoneme sequence is: |A,LM). Where LM represents the neural network language model, and β represents the weight coefficient of the neural network language model;
[0122] B8: Based on the output of the acoustic model and the neural network language model, the phoneme sequence is decoded using the beam search algorithm, the decoded phoneme sequence is converted into a text sequence, and the text sequence is output.
[0123] The specific steps for B8 include:
[0124] B8.1: Set the beamwidth parameter λ and initialize a candidate phoneme sequence set D; the beamwidth parameter is the number of candidate sequences retained at each time step t;
[0125] The candidate phoneme sequence set D is initialized to be empty.
[0126] B8.2: At each time step t, new candidate phoneme sequences are expanded from the current candidate phoneme sequence set D, and the phoneme probability distribution P(p b The logarithm of |A), plus the phoneme sequence probability P(p b The product of the logarithm of |A,LM) and the weight coefficient β of the neural network language model yields each new candidate phoneme sequence p. 1:b Score (p) 1:b ) = logP(p b |A)+β×logP(p b |A,LM);
[0127] B8.3: For each time step t, update the candidate phoneme sequence set to obtain the updated candidate phoneme sequence set.
[0128] B8.4: If the updated candidate phoneme sequence set Reaching the preset maximum time length Then stop the search;
[0129] B8.5: From The phoneme sequence with the highest score is selected as the final phoneme sequence. At the same time, the final phoneme sequence p * Mapped to a text sequence Text(p) * ), and output the final result, where, Indicates from the set of candidate phoneme sequences Find the phoneme sequence p that maximizes Score(p). * .
[0130] The specific implementation steps of the semantic understanding module include:
[0131] C1: Define entities and relations in a semantic network;
[0132] C2: A knowledge graph is constructed using structured and unstructured data and stored in a graph database. The structured data is obtained from medical literature and clinical guidelines, while the unstructured data is obtained from patient medical records. The graph database uses Neo4j.
[0133] C3: Load a pre-trained language model as the base model and fine-tune it using domain data. The language model uses the GPT model.
[0134] Domain data refers to medical-related text data, including medical literature, clinical guidelines, patient medical records, and drug instructions, acquired through public datasets or internal enterprise data. Using a pre-trained language model, such as GPT, as the base model and fine-tuning it with domain data can improve the model's performance in specific domain tasks. The fine-tuning process includes:
[0135] (1) Load the pre-trained language model;
[0136] (2) Convert the domain data into the input format of the pre-trained language model. The conversion process is achieved by segmenting and encoding the text and generating labels.
[0137] (3) Define the loss function and optimizer;
[0138] (4) Use domain data to fine-tune the pre-trained language model and evaluate the performance of the fine-tuned language model. The evaluation metric is accuracy.
[0139] (5) Save the fine-tuned language model.
[0140] C4: For the input text sequence Text(p) * Semantic analysis is performed to identify the user's intent, emotions, and contextual information.
[0141] The specific implementation steps of the semantic retrieval unit include:
[0142] D1: Based on the semantic analysis results in C4, generate the query statement R = f(I, U), where I represents the user intent, U represents the key entity, and f represents the query generation function;
[0143] D2: Calculate the semantic similarity between the query statement R and the content in the knowledge base to obtain preliminary matching results. The similarity is based on cosine similarity, which is existing technology in this field and is not an inventive solution of this application. It will not be elaborated here.
[0144] D3: Utilize the reasoning capabilities of the knowledge graph, combined with user input and contextual information, to obtain the reasoning result F; the reasoning result includes a diagnosis or treatment plan. The knowledge graph is existing technology in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0145] D4: Sort the preliminary matching results and inference results, and add the inference score Inference(F) to the cosine similarity between the query statement R and the inference result F, combined with the weight coefficient η, to return the inference result with the highest score. in, The score is
[0146] Sim(·) represents the cosine similarity function.
[0147] The weight coefficient η ranges from [0,1]. Typically, if the query task focuses more on semantic matching, such as question-answering systems or information retrieval, then η is set to [0.6,0.8].
[0148] If the query task focuses more on logical reasoning, such as a diagnostic system or a recommendation system, then set η∈[0.2,0.4].
[0149] The specific implementation steps of the speech synthesis module include:
[0150] E1: Preprocess the text responses generated by the system, including word segmentation, punctuation processing, and text normalization;
[0151] E2: Analyze the user's emotional state to determine the tone, speed, and emotional expression of the synthesized speech;
[0152] E3: Load the pre-trained deep learning-based speech synthesis model and input the pre-processed text response into the trained deep learning-based speech synthesis model to generate natural speech; the deep learning-based speech synthesis model is configured based on a deep learning model, and the deep learning model is existing technology in this field and is not an inventive solution of this application, so it will not be described in detail here.
[0153] E4: Adjust the tone, speed, and emotional expression of the synthesized speech based on the user's emotional state, and output the adjusted speech to the user.
[0154] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments under the guidance of the present invention without departing from the spirit and scope of the present invention. All of these variations are within the protection scope of the present invention.
[0155] If the technical solution disclosed herein involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution disclosed herein involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the individual through pop-up information or by asking the individual to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.
Claims
1. A doctor-patient intelligent voice interaction service system based on semantic understanding, characterized in that, include: The system includes a voice input module, a semantic understanding module, a knowledge base module, a dialogue management module, and a speech synthesis module. The voice input module includes a noise reduction processing unit and a speech-to-text unit; the noise reduction processing unit is equipped with an adaptive noise cancellation algorithm, which is used to perform noise reduction processing on the input voice. The speech-to-text unit is equipped with a speech recognition model, which is used to convert the noise-reduced speech into text data. The semantic understanding module is used to perform semantic analysis on text data using a constructed semantic understanding model based on semantic networks and knowledge graphs, and to identify the user's intent, emotion and contextual information; The knowledge base module includes a semantic retrieval unit and a knowledge reasoning unit. The semantic retrieval unit is used to perform semantic retrieval in the medical knowledge base using a combination of semantic similarity matching and logical reasoning to obtain preliminary matching results. The knowledge reasoning unit uses the reasoning capabilities of the knowledge graph, combined with user input and contextual information, to deduce diagnostic results or treatment plans. The dialogue management module is used to combine semantic analysis results and knowledge base query results to generate dialogue strategies and generate natural language responses. The speech synthesis module is used to convert the generated text reply into natural speech, adjust the tone, speed and emotional expression of the synthesized speech in combination with the user's emotional state, and output it to the user. The specific implementation steps of the semantic understanding module include: C1: Define entities and relations in a semantic network; C2: A knowledge graph is constructed using structured and unstructured data and stored in a graph database. The structured data is obtained from medical literature and clinical guidelines, while the unstructured data is obtained from patient medical records. The graph database uses Neo4j. C3: Load a pre-trained language model as the base model and fine-tune it using domain data; C4: For the input text sequence Perform semantic analysis to identify user intent, sentiment, and contextual information; The specific implementation steps of the semantic retrieval unit include: D1: Generate a query statement based on the semantic analysis results in C4. Where I represents user intent, U represents key entity, and f represents query generation function; D2: Calculate the semantic similarity between query statement R and the content in the knowledge base to obtain preliminary matching results; D3: Utilize the reasoning capabilities of the knowledge graph, combined with user input and contextual information, to obtain the reasoning result F; the reasoning result includes a diagnosis or treatment plan. D4: Sort the preliminary matching results and inference results, and add the inference score to the cosine similarity between the query statement R and the inference result F. Combined with weighting coefficients Return the reasoning result with the highest score. ; The specific implementation steps of the speech synthesis module include: E1: Preprocess the text responses generated by the system, including word segmentation, punctuation processing, and text normalization; E2: Analyze the user's emotional state to determine the tone, speed, and emotional expression of the synthesized speech; E3: Load the pre-trained deep learning-based speech synthesis model and input the pre-processed text response into the trained deep learning-based speech synthesis model to generate natural speech; E4: Adjust the tone, speed, and emotional expression of the synthesized speech based on the user's emotional state, and output the adjusted speech to the user.
2. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 1, characterized in that, The specific steps of the adaptive noise cancellation algorithm include: A1: Acquire speech signals and reference noise signals, and perform time synchronization and alignment; the speech signals include target speech and noise; A2: Align the speech signal and reference noise signal Frame segmentation is performed to obtain the speech signals in B time frames. and reference noise signal And window each frame of signal is applied, where n represents the number of sampling points; A3: In each frame of signal, initialize the weight vector of the adaptive filter. And set the step size factor. ,in, This represents the (L-1)th weight coefficient in the adaptive filter weight vector, where L represents the length of the adaptive filter and T represents the transpose. A4: Based on reference noise signal Construct the delay vector of the reference noise signal ,in, Indicates the reference noise signal at time point The sampled values; A5: Use an adaptive filter to adjust the delay vector of the reference noise signal. Filtering is performed to obtain the estimated noise signal. .
3. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 2, characterized in that, The specific steps of the adaptive noise cancellation algorithm also include: A6: By calculating the speech signal of each frame and estimated noise signal The difference is used to obtain an estimate of the clean speech signal in the i-th frame. ; A7: Overlap and sum the estimated values of the clean speech signal processed by frame segmentation to obtain the reconstructed clean speech signal. ; A8: Set the convergence threshold Based on the adaptive filter's weight vector, step size factor, and the delay vectors of the reconstructed clean speech signal and the reference noise signal, an adaptive algorithm is used to update the adaptive filter's weight vector. ; A9: Perform iterative repetition, if the following conditions are met... Then, the step size factor is dynamically adjusted, and the final filter weight vector is output. The estimated value of the clean speech signal in the i-th frame and clean voice signal ,in, This represents the mean square error function.
4. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 3, characterized in that, The specific steps of the speech-to-text unit include: B1: Obtain the estimated value of the clean speech signal in the i-th frame. ; B2: Convert each frame of clean speech signal to the frequency domain using short-time Fourier transform to obtain the spectrum of each frame of clean speech signal. and power spectrum ; B3: Design a set of Mel filters The power spectrum of each frame of clean speech signal With Mel filter Multiply and sum to obtain the energy of the Mel filter bank. Where m represents the Mel filter index; B4: Yes Taking the logarithm and performing a discrete cosine transform, we obtain the l-th Mel frequency cepstral coefficient. , where l represents the index of the Mel frequency cepstral coefficient.
5. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 4, characterized in that, The specific steps of the speech-to-text unit also include: B5: Load the pre-trained acoustic model and convert the Mel frequency cepstral coefficients. Organization as time series and time series The input is fed into a pre-trained acoustic model, and the phoneme probability distribution for each frame is output through the forward propagation of the acoustic model. ,in, Let A represent the b-th phoneme, where b is the phoneme index and A represents the Mel-frequency cepstral coefficient sequence. The cepstral coefficients of the Mel frequency in frame B are represented, where B represents the time frame number; the acoustic model is configured based on a recurrent neural network model. B6: Phoneme probability distribution Organized as phoneme sequences and will The input is fed into a preloaded neural network language model to calculate the context-dependent probability of each phoneme; B7: Phoneme probability distribution of the acoustic model using the context-dependent probability of each phoneme. Perform correction, obtain and output the corrected phoneme sequence probability. Where LM represents a neural network language model; B8: Based on the output of the acoustic model and the neural network language model, the phoneme sequence is decoded using the beam search algorithm, the decoded phoneme sequence is converted into a text sequence, and the text sequence is output.
6. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 5, characterized in that, The specific steps of B8 include: B8.1: Set beam width parameters And initialize a candidate phoneme sequence set D; the beam width parameter is the number of candidate sequences retained at each time step t; B8.2: At each time step t, new candidate phoneme sequences are expanded from the current candidate phoneme sequence set D, and the phoneme probability distribution is... The logarithm of the phoneme sequence, plus the probability of the phoneme sequence. Logarithm and weight coefficients of neural network language models The product of these products yields each new candidate phoneme sequence. Score .
7. The doctor-patient intelligent voice interaction service system based on semantic understanding as described in claim 6, characterized in that, The specific steps of B8 also include: B8.3: For each time step t, update the candidate phoneme sequence set to obtain the updated candidate phoneme sequence set. ; B8.4: If the updated candidate phoneme sequence set Reaching the preset maximum time length If so, then stop the search; B8.5: From The phoneme sequence with the highest score is selected as the final phoneme sequence. At the same time, the final phoneme sequence Mapped to a text sequence and output the final result, where, Indicates the set of candidate phoneme sequences Find Phoneme sequence that achieves the maximum value .