A man-machine conversation method and system for traditional Chinese medicine diagnosis

By employing multimodal data fusion technology and a cross-attention mechanism, standardized mapping and dynamic questioning of symptom locations in the TCM diagnostic system have been achieved, resolving the inconsistency in symptom information collection in existing technologies and improving the accuracy and comprehensiveness of diagnosis.

CN122201720APending Publication Date: 2026-06-12HAIKOU ZHONGXIA TRADITIONAL CHINESE MEDICINE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HAIKOU ZHONGXIA TRADITIONAL CHINESE MEDICINE TECHNOLOGY CO LTD
Filing Date
2026-03-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing TCM diagnostic systems lack standardized terminology, cannot process patients' natural hand gestures to identify the location of symptoms, and cannot perform dynamic reasoning and follow-up questions based on preliminary identification results, leading to inconsistent and incomplete symptom information collection and insufficient accuracy in diagnosis.

Method used

Using multimodal data fusion technology, real-time video streams and voice data from patients are simultaneously received through binocular cameras and microphone arrays. Combined with a multimodal Transformer model with a cross-attention mechanism, the identification of lesion sites and semantic parsing are achieved, and dynamic follow-up questioning strategies are generated based on a traditional Chinese medicine knowledge graph.

🎯Benefits of technology

It enables the simultaneous acquisition and deep alignment of patients' natural gestures and voice symptoms, ensuring the standardization and consistency of symptom descriptions and improving the comprehensiveness and accuracy of TCM diagnosis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201720A_ABST
    Figure CN122201720A_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of man-machine conversation, and particularly relates to a man-machine conversation method and system for traditional Chinese medicine diagnosis, comprising: synchronously receiving multi-modal data input by a patient; performing gesture detection and body part positioning on real-time video stream, identifying a disease part pointed by the patient, performing speech recognition and semantic analysis on speech data, extracting disease description features, and obtaining multi-modal fusion disease representation; inputting the multi-modal fusion disease representation into a traditional Chinese medicine knowledge graph reasoning engine, determining disease site information based on a body part-traditional Chinese medicine disease site mapping relationship, and generating a dynamic follow-up question strategy in combination with the disease description features, and outputting follow-up questions for pain nature, persistence or accompanying symptoms to the patient; receiving feedback data of the patient on the follow-up questions, integrating multi-round conversation information to generate a traditional Chinese medicine syndrome differentiation result and output a diagnosis suggestion. The present application realizes more standardized and comprehensive symptom collection than traditional artificial inquiry, and significantly improves the accuracy and efficiency of diagnosis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of human-computer dialogue technology, and in particular to a human-computer dialogue method and system for traditional Chinese medicine diagnosis. Background Technology

[0002] With the development of artificial intelligence and multimodal interaction technologies, human-computer dialogue systems have been gradually applied in the field of medical auxiliary diagnosis, providing a new technological path to alleviate the strain on medical resources and improve consultation efficiency. In traditional Chinese medicine (TCM) diagnosis, traditional consultations rely on face-to-face communication between doctors and patients, collecting symptom information through the four diagnostic methods of "inspection, auscultation, inquiry, and palpation." Doctors subjectively judge and record the location, nature, and accompanying symptoms of pain described by patients based on their experience. In existing technologies, some systems use voice recognition technology to collect patient symptom descriptions or trigger fixed consultation processes through preset simple gesture commands, achieving basic human-computer interaction functions.

[0003] However, the aforementioned existing technologies have significant shortcomings. First, in traditional manual consultations, different doctors exhibit individual differences in their descriptions and expressions of symptom location and pain nature, lacking standardized terminology. This easily leads to inconsistencies and omissions in symptom information collection, affecting the accuracy of diagnosis. Second, existing single-modal interaction systems cannot handle complex scenarios where patients identify symptom locations through natural gestures (such as pointing to any body part), lacking the intelligent understanding of "pointing-semantics" mapping. Furthermore, existing technologies rely solely on preset, fixed interaction rules, failing to perform dynamic reasoning and follow-up questioning based on TCM knowledge graphs according to preliminary recognition results, making it difficult to achieve the diagnostic thinking process of renowned TCM doctors. In addition, existing visual interaction technologies are mostly limited to simple directional control or digital command recognition, unable to map the body parts pointed to by the patient to standardized TCM disease locations (such as the Taiyang meridian, the area below the heart, chest pain, etc.), and even less able to combine multimodal information to generate targeted follow-up questioning strategies, resulting in insufficient comprehensiveness and standardization of symptom collection. Summary of the Invention

[0004] The purpose of this invention is to provide a human-computer dialogue method and system for TCM diagnosis, thereby solving the aforementioned problems in the prior art.

[0005] To achieve the above objectives, the present invention provides a human-computer dialogue method for traditional Chinese medicine diagnosis, comprising the following steps: S1. Simultaneously receive multimodal data input by the patient; S2. Perform gesture detection and body part localization on the real-time video stream, identify the symptom area pointed out by the patient, perform speech recognition and semantic parsing on the speech data, extract symptom description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion symptom representation. S3. Input the multimodal fusion of symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up questioning strategies in combination with symptom description features to output follow-up questions to patients regarding the nature, persistence, or accompanying symptoms of pain. S4. Receive patient feedback data on follow-up questions, integrate information from multiple rounds of dialogue to generate TCM syndrome differentiation results and output diagnostic suggestions.

[0006] In some embodiments of this application, in S1, the multimodal data includes at least a real-time video stream containing gesture recognition actions and voice data describing the characteristics of the disease.

[0007] In some embodiments of this application, in S1, synchronously receiving multimodal data input by the patient includes: Real-time video streams containing patient gestures and identification actions are acquired using binocular or depth cameras. The real-time video stream is processed by image frame extraction to obtain a continuous sequence of multiple static images, and timestamp information is marked in each image. Voice data describing the patient's symptoms is acquired using a microphone array. The voice data is processed by frame segmentation and marked with timestamps corresponding to the video frames. Based on timestamp information, video data and audio data are time-series aligned to establish a mapping relationship between gestures and audio content at the same point in time, resulting in a synchronized multimodal data stream.

[0008] In some embodiments of this application, in step S2, gesture detection and body part localization are performed on the real-time video stream to identify the affected area pointed to by the patient, and speech recognition and semantic parsing are performed on the speech data to extract symptom description features, including: Human pose estimation is performed on each frame of static image in the real-time video stream. Key nodes of the hand and key points of the body contour are extracted. By calculating the intersection of the finger pointing vector and the body part region, the disease location identified by the patient is preliminarily determined, and the location is mapped to the standard disease location name in the TCM disease location knowledge base. Automatic speech recognition is performed on the speech data to obtain text information. The medical named entity recognition model is used to extract the disease description features, including the nature of pain, the degree of pain, the frequency of attacks and accompanying symptoms, and to establish a structured representation of the disease features.

[0009] In some embodiments of this application, in S2, the multimodal fusion symptom representation obtained by combining the multimodal Transformer model with the Cross-Attention mechanism includes: The visual feature vector is obtained by vectorizing the lesion site determined by the intersection of the finger pointing vector and the body part region. The text feature vector is obtained by semantically encoding the disease description features extracted from speech text using a medical named entity recognition model. Visual feature vectors and text feature vectors By mapping to the same dimensional space through a linear projection layer, a query matrix, a key matrix, and a value matrix are generated. The Cross-Attention mechanism is used, with visual features of the symptom location as the query variable and textual features describing the symptom as the key and value. A multimodal fused symptom representation is obtained through attention-weighted fusion, expressed as: ; in, To represent the symptoms of multimodal fusion, This is a query matrix for visual features of disease locations determined by the intersection of finger pointing vectors and body part regions. The key matrix, Let T be a matrix of values, and T be the matrix transpose. Let be the dimension of the key vector. ( ) represents the normalized exponential function.

[0010] In some embodiments of this application, in S3, the multimodal fusion of symptom representations is input into the Traditional Chinese Medicine knowledge graph inference engine, and the symptom location information is determined based on the body part-TCM symptom location mapping relationship, including: The spatial location encoding features in the multimodal fusion of disease representations are analyzed and mapped to the disease location determined by the intersection of the finger pointing vector and the body part region. The body part-disease location mapping relationship associated with the disease location is retrieved in the TCM knowledge graph. The body part-disease location mapping relationship includes the entire human body, namely the head, face, body, and limbs. This mapping relationship is then incorporated into the traditional Chinese medicine meridian atlas.

[0011] In some embodiments of this application, in S3, the dynamic follow-up questioning strategy includes: Different follow-up questions are determined based on the different areas of pain. Patients are asked about the specific location of the pain, and based on their feedback, they are asked further about the nature of the pain, its duration, and any accompanying symptoms. The nature of the pain includes at least throbbing pain, dull pain, and stabbing pain; the persistence includes continuous pain and intermittent pain; the accompanying symptoms include at least nausea, vomiting, sweating, chills, fever, and diarrhea.

[0012] In some embodiments of this application, a human-computer dialogue system for traditional Chinese medicine diagnosis is also disclosed, comprising: The receiving module is used to synchronously receive multimodal data input by the patient; The parsing module is used to perform gesture detection and body part localization on real-time video streams, identify the disease parts pointed out by patients, perform speech recognition and semantic parsing on speech data, extract disease description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion disease representation. The follow-up question module is used to input multimodal fused symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up question strategies in combination with symptom description features, outputting follow-up questions to patients on the nature, persistence or accompanying symptoms of pain. The diagnostic module receives feedback data from patients on follow-up questions, integrates information from multiple rounds of dialogue to generate TCM syndrome differentiation results, and outputs diagnostic suggestions.

[0013] The advantages and beneficial effects of this invention compared to the prior art are: 1. This invention integrates visual gesture recognition and speech recognition technologies to construct a multimodal human-computer dialogue system, which realizes the synchronous acquisition and deep alignment of patients' natural gesture recognition (such as pointing to any body part) and speech symptom description. It breaks through the limitations of existing single-modal interaction or simple gesture command recognition, and significantly improves the comprehensiveness of TCM consultation information acquisition and natural interactive experience.

[0014] 2. Based on the mapping relationship between body parts and TCM disease locations, this invention transforms patients' non-standardized natural identifications into standardized TCM disease location terms, and combines the Cross-Attention mechanism to achieve multimodal feature fusion, effectively avoiding information bias caused by differences in subjective expression among different doctors, ensuring the standardization and consistency of symptom description, and greatly improving diagnostic results.

[0015] 3. Based on the initially identified lesion location and symptom characteristics, this invention dynamically generates targeted questioning strategies, progressively inquiring about the nature of pain, duration, and accompanying symptoms. Through multiple rounds of dialogue, the symptom information is gradually improved, significantly enhancing the completeness of symptom collection and the accuracy of diagnosis.

[0016] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0017] Figure 1 This is a flowchart of a human-computer dialogue method for traditional Chinese medicine diagnosis in an embodiment of the present invention; Figure 2 This is a structural diagram of a human-computer dialogue system for traditional Chinese medicine diagnosis according to an embodiment of the present invention. Detailed Implementation

[0018] In the description of this invention, it should be noted that the terms "upper," "lower," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, or the orientation or positional relationship commonly used when the product is in use. They are used only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. In the description of this invention, it should also be noted that, unless otherwise explicitly specified and limited, the terms "set," "install," and "connect" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal communication between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0019] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0020] like Figure 1 As shown, the present invention provides a human-computer dialogue method for TCM diagnosis, comprising the following steps: S1. Simultaneously receive multimodal data input by the patient; S2. Perform gesture detection and body part localization on the real-time video stream, identify the symptom area pointed out by the patient, perform speech recognition and semantic parsing on the speech data, extract symptom description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion symptom representation. S3. Input the multimodal fusion of symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up questioning strategies in combination with symptom description features to output follow-up questions to patients regarding the nature, persistence, or accompanying symptoms of pain. S4. Receive patient feedback data on follow-up questions, integrate information from multiple rounds of dialogue to generate TCM syndrome differentiation results and output diagnostic suggestions.

[0021] This invention integrates visual gesture recognition and speech recognition technologies to construct a multimodal human-computer dialogue system. It achieves synchronous acquisition and deep alignment of patients' natural gesture recognition (such as pointing to any body part) and speech symptom description, breaking through the limitations of existing single-modal interaction or simple gesture command recognition, and significantly improving the comprehensiveness of TCM consultation information acquisition and natural interactive experience.

[0022] In some embodiments of this application, in S1, the multimodal data includes at least a real-time video stream containing gesture recognition actions and voice data describing the characteristics of the disease.

[0023] In some embodiments of this application, in S1, synchronously receiving multimodal data input by the patient includes: Real-time video streams containing patient gestures and identification actions are acquired using binocular or depth cameras. The real-time video stream is processed by image frame extraction to obtain a continuous sequence of multiple static images, and timestamp information is marked in each image. Voice data describing the patient's symptoms is acquired using a microphone array. The voice data is processed by frame segmentation and marked with timestamps corresponding to the video frames. Based on timestamp information, video data and audio data are time-series aligned to establish a mapping relationship between gestures and audio content at the same point in time, resulting in a synchronized multimodal data stream.

[0024] In some embodiments of this application, in step S2, gesture detection and body part localization are performed on the real-time video stream to identify the affected area pointed to by the patient, and speech recognition and semantic parsing are performed on the speech data to extract symptom description features, including: Human pose estimation is performed on each frame of static image in the real-time video stream. Key nodes of the hand and key points of the body contour are extracted. By calculating the intersection of the finger pointing vector and the body part region, the disease location identified by the patient is preliminarily determined, and the location is mapped to the standard disease location name in the TCM disease location knowledge base. Automatic speech recognition is performed on the speech data to obtain text information. The medical named entity recognition model is used to extract the disease description features, including the nature of pain, the degree of pain, the frequency of attacks and accompanying symptoms, and to establish a structured representation of the disease features.

[0025] In some embodiments of this application, in S2, the multimodal fusion symptom representation obtained by combining the multimodal Transformer model with the Cross-Attention mechanism includes: The visual feature vector is obtained by vectorizing the lesion site determined by the intersection of the finger pointing vector and the body part region. The text feature vector is obtained by semantically encoding the disease description features extracted from speech text using a medical named entity recognition model. Visual feature vectors and text feature vectors By mapping to the same dimensional space through a linear projection layer, a query matrix, a key matrix, and a value matrix are generated. The Cross-Attention mechanism is used, with visual features of the symptom location as the query variable and textual features describing the symptom as the key and value. A multimodal fused symptom representation is obtained through attention-weighted fusion, expressed as: ; in, To represent the symptoms of multimodal fusion, This is a query matrix for visual features of disease locations determined by the intersection of finger pointing vectors and body part regions. The key matrix, Let T be a matrix of values, and T be the matrix transpose. Let be the dimension of the key vector. ( ) represents the normalized exponential function.

[0026] This invention is based on the mapping relationship between body parts and TCM disease locations. It converts patients' non-standardized natural identification into standardized TCM disease location terms and combines the Cross-Attention mechanism to achieve multimodal feature fusion. This effectively avoids information bias caused by differences in subjective expression among different doctors, ensures the standardization and consistency of symptom description, and greatly improves the diagnostic effect.

[0027] In some embodiments of this application, in S3, the multimodal fusion of symptom representations is input into the Traditional Chinese Medicine knowledge graph inference engine, and the symptom location information is determined based on the body part-TCM symptom location mapping relationship, including: The spatial location encoding features in the multimodal fusion of disease representations are analyzed and mapped to the disease location determined by the intersection of the finger pointing vector and the body part region. The body part-disease location mapping relationship associated with the disease location is retrieved in the TCM knowledge graph. The body part-disease location mapping relationship includes the entire human body, namely the head, face, body, and limbs. This mapping relationship is then incorporated into a traditional Chinese medicine meridian chart. For example, the back of the head and back of the neck are mapped to the Taiyang meridian; pain and stiffness in this area are described as severe headache and neck pain, and a systemic diagnosis of this is highly likely to be an exogenous disease. The sides of the head are mapped to the Shaoyang meridian; pain in this area is described as migraine, and a systemic diagnosis of this is Shaoyang headache, with the disease closely related to the gallbladder and liver.

[0028] In some embodiments of this application, in S3, the dynamic follow-up questioning strategy includes: Different follow-up questions are determined based on the different areas of pain. Patients are asked about the specific location of the pain, and based on their feedback, they are asked further about the nature of the pain, its duration, and any accompanying symptoms. The nature of the pain includes at least throbbing pain, dull pain, and stabbing pain; the persistence includes continuous pain and intermittent pain; the accompanying symptoms include at least nausea, vomiting, sweating, chills, fever, and diarrhea.

[0029] This invention dynamically generates targeted questioning strategies based on the initially identified lesion location and symptom characteristics. It progressively asks questions about the nature of pain, duration, and accompanying symptoms, and gradually improves the symptom information through multiple rounds of dialogue, significantly enhancing the completeness of symptom collection and the accuracy of diagnosis.

[0030] In some embodiments of this application, such as Figure 2As shown, a human-computer dialogue system for TCM diagnosis is also disclosed, including: The receiving module is used to synchronously receive multimodal data input by the patient; The parsing module is used to perform gesture detection and body part localization on real-time video streams, identify the disease parts pointed out by patients, perform speech recognition and semantic parsing on speech data, extract disease description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion disease representation. The follow-up question module is used to input multimodal fused symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up question strategies in combination with symptom description features, outputting follow-up questions to patients on the nature, persistence or accompanying symptoms of pain. The diagnostic module receives feedback data from patients on follow-up questions, integrates information from multiple rounds of dialogue to generate TCM syndrome differentiation results, and outputs diagnostic suggestions.

[0031] The following examples demonstrate two experimental verifications.

[0032] Example 1: Patient Zhang sought medical attention for headache. The system acquires real-time video streams of the patient through a binocular camera and audio data through a microphone array, establishing a time-aligned multimodal data stream.

[0033] In step S2, the system performs human pose estimation on the video stream, extracts key nodes of the patient's hands, calculates the intersection of the finger pointing vector and the head contour region, and determines that the patient is pointing to the back of the head. Simultaneously, it performs automatic speech recognition on the speech data, obtaining the text "My head hurts here, and my neck is also uncomfortable." Using a medical named entity recognition model, it extracts the location of the pain (head and back of the neck are called "xiang" in Traditional Chinese Medicine), the nature of the pain (undefined), and radiation features (extending to the back of the neck). The system vectorizes the back of the head region into a visual feature vector and encodes the semantic meaning of the text symptoms into a text feature vector. Through a Cross-Attention mechanism, it obtains a multimodal fusion symptom representation, identifying the location of the lesion as the Taiyang meridian (the Foot Taiyang Bladder Meridian runs along the back of the head and neck).

[0034] In step S3, the system inputs the multimodal fusion symptom representation into the TCM knowledge graph reasoning engine, identifying the lesion location as Taiyang meridian headache. Since the nature of the pain was not clearly extracted, the system generates a dynamic follow-up question strategy, outputting to the patient: "Based on the location you indicated, this is the Taiyang meridian pathway. Is your headache stabbing, throbbing, or pulsating? Is the pain continuous or intermittent? Is it accompanied by chills or fever?" The patient's feedback: "It's like being pricked with needles, coming in waves, and it gets worse when exposed to wind." The system receives the feedback data, integrates the multi-turn dialogue information, and generates a TCM diagnosis result in step S4: Taiyang meridian headache, syndrome of wind-cold invasion, blood stasis obstructing the meridians. The system outputs diagnostic suggestions: dispel wind and cold, invigorate blood and unblock the meridians, recommending the use of Chuanxiong Tea Decoction with modifications, and instructing the patient to keep their neck warm.

[0035] Example 2: Patient Li presented with upper abdominal discomfort. The system simultaneously received multimodal data. The patient pointed to his epigastric region (below the xiphoid process and upper abdomen) and verbally stated, "I feel a blockage here, it's worse after eating, and sometimes I experience acid reflux."

[0036] In step S2, the system calculates the intersection of the finger pointing vector and the body part region to initially determine the symptom location as the epigastric region (lower abdomen), mapping it to the "epigastric" and "lower abdomen" regions in the TCM disease location knowledge base. Speech recognition extracts symptom description features: the nature of the pain is "a feeling of blockage" (bloating), the onset time is "worse after meals," and the accompanying symptom is "acid reflux." The visual features of the epigastric region are fused with semantic features such as bloating, worsening after meals, and acid reflux using a multimodal Transformer model to form a multimodal fusion symptom representation of epigastric qi stagnation.

[0037] In step S3, the system determines the location of the ailment as the epigastric region (lower abdomen) based on the mapping relationship between body parts and TCM disease locations. Combining this with characteristics such as worsening after meals and acid reflux, a dynamic follow-up questioning strategy is generated to clarify the nature of the ailment. The system outputs the following questions to the patient: "You are referring to the epigastric region (lower abdomen). Specifically, is the pain in the upper or lower abdomen? Is it a distending pain, a stabbing pain, or a dull pain? Is the pain continuous or intermittent? Is it accompanied by nausea, vomiting, or chills?" The patient adds: "It's just below my heart, a feeling of fullness, not constant pain, intermittent, sometimes nausea but no vomiting." The system integrates information from multiple rounds, confirming the location of the ailment as the epigastric region (upper part of the stomach), and the nature of the ailment as liver-stomach disharmony and qi stagnation.

[0038] In this application, unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. In case of any inconsistency, the meaning set forth in this specification or derived from the content described herein shall prevail. Furthermore, the terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit the scope of this application.

[0039] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the technical solutions of the present invention, and these modifications or equivalent substitutions cannot cause the modified technical solutions to deviate from the spirit and scope of the technical solutions of the present invention.

Claims

1. A human-computer dialogue method for TCM diagnosis, characterized in that, Includes the following steps: S1. Simultaneously receive multimodal data input by the patient; S2. Perform gesture detection and body part localization on the real-time video stream, identify the symptom area pointed out by the patient, perform speech recognition and semantic parsing on the speech data, extract symptom description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion symptom representation. S3. Input the multimodal fusion of symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up questioning strategies in combination with symptom description features to output follow-up questions to patients regarding the nature, persistence, or accompanying symptoms of pain. S4. Receive patient feedback data on follow-up questions, integrate information from multiple rounds of dialogue to generate TCM syndrome differentiation results and output diagnostic suggestions.

2. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In S1, the multimodal data includes at least a real-time video stream containing gesture recognition actions and voice data describing the characteristics of the disease.

3. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In step S1, the synchronous reception of multimodal data input by the patient includes: Real-time video streams containing patient gestures and identification actions are acquired using binocular or depth cameras. The real-time video stream is processed by image frame extraction to obtain a continuous sequence of multiple static images, and timestamp information is marked in each image. Voice data describing the patient's symptoms is acquired using a microphone array. The voice data is processed by frame segmentation and marked with timestamps corresponding to the video frames. Based on timestamp information, video data and audio data are time-series aligned to establish a mapping relationship between gestures and audio content at the same point in time, resulting in a synchronized multimodal data stream.

4. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In step S2, gesture detection and body part localization are performed on the real-time video stream to identify the affected area pointed to by the patient. Speech recognition and semantic parsing are performed on the voice data to extract symptom description features, including: Human pose estimation is performed on each frame of static image in the real-time video stream. Key nodes of the hand and key points of the body contour are extracted. By calculating the intersection of the finger pointing vector and the body part region, the disease location identified by the patient is preliminarily determined, and the location is mapped to the standard disease location name in the TCM disease location knowledge base. Automatic speech recognition is performed on the speech data to obtain text information. The medical named entity recognition model is used to extract the disease description features, including the nature of pain, the degree of pain, the frequency of attacks and accompanying symptoms, and to establish a structured representation of the disease features.

5. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In S2, the multimodal Transformer model combined with the Cross-Attention mechanism yields a multimodal fusion symptom representation including: The visual feature vector is obtained by vectorizing the lesion site determined by the intersection of the finger pointing vector and the body part region. The text feature vector is obtained by semantically encoding the disease description features extracted from speech text using a medical named entity recognition model. Visual feature vectors and text feature vectors By mapping to the same dimensional space through a linear projection layer, a query matrix, a key matrix, and a value matrix are generated. The Cross-Attention mechanism is used, with visual features of the symptom location as the query variable and textual features describing the symptom as the key and value. A multimodal fused symptom representation is obtained through attention-weighted fusion, expressed as: ; in, To represent the symptoms of multimodal fusion, This is a query matrix for visual features of disease locations determined by the intersection of finger pointing vectors and body part regions. The key matrix, Let T be a matrix of values, and T be the matrix transpose. Let be the dimension of the key vector. ( ) represents the normalized exponential function.

6. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In step S3, the multimodal fusion of symptom representations is input into the TCM knowledge graph reasoning engine, and the symptom location information is determined based on the body part-TCM symptom location mapping relationship, including: The spatial location encoding features in the multimodal fusion of disease representations are analyzed and mapped to the disease location determined by the intersection of the finger pointing vector and the body part region. The body part-disease location mapping relationship associated with the disease location is retrieved in the TCM knowledge graph. The body part-disease location mapping relationship includes the entire human body, namely the head, face, body, and limbs, mapping the entire human body into the TCM meridian map.

7. The human-computer dialogue method for TCM diagnosis according to claim 1, characterized in that, In S3, the dynamic follow-up questioning strategy includes: Different follow-up questions are determined based on the different areas of pain. Patients are asked about the specific location of the pain, and based on their feedback, they are asked further about the nature of the pain, its duration, and any accompanying symptoms. The nature of the pain includes at least throbbing pain, dull pain, and stabbing pain; the persistence includes continuous pain and intermittent pain; the accompanying symptoms include at least nausea, vomiting, sweating, chills, fever, and diarrhea.

8. A human-computer dialogue system for traditional Chinese medicine diagnosis, characterized in that, include: The receiving module is used to synchronously receive multimodal data input by the patient; The parsing module is used to perform gesture detection and body part localization on real-time video streams, identify the disease parts pointed out by patients, perform speech recognition and semantic parsing on speech data, extract disease description features, and combine the multimodal Transformer model with the Cross-Attention mechanism to obtain a multimodal fusion disease representation. The follow-up question module is used to input multimodal fused symptom representations into the TCM knowledge graph reasoning engine, determine the symptom location information based on the mapping relationship between body parts and TCM symptom locations, and generate dynamic follow-up question strategies in combination with symptom description features, outputting follow-up questions to patients on the nature, persistence or accompanying symptoms of pain. The diagnostic module receives feedback data from patients on follow-up questions, integrates information from multiple rounds of dialogue to generate TCM syndrome differentiation results, and outputs diagnostic suggestions.