Intelligent question and answer generation method and device based on multi-modal model, equipment and medium
By collecting and preprocessing multimodal data, visual, semantic, and speech feature vectors are generated and fused together. These vectors are then input into a large multimodal model for inference, solving the problems of single data collection and lack of doctor review in dermatology consultation systems, thus improving diagnostic accuracy and consultation efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing dermatology consultation systems suffer from limited data collection, single model modality, and lack of doctor verification, resulting in low diagnostic accuracy and failure to meet clinical medical safety requirements.
The system collects patient image data, text data, and voice data, preprocesses them to generate visual feature vectors, semantic feature vectors, and voice intonation feature vectors, aligns the data according to timestamps and patient identifiers, and fuses them into multimodal data samples. These samples are then input into a large multimodal model for inference, and preliminary diagnostic results and treatment suggestions are output.
It has improved the accuracy of dermatological diagnosis and the efficiency of consultation, realized the comprehensive collection and processing of multi-dimensional disease information, ensured the reliability of diagnostic results and the appropriateness of treatment suggestions, and reduced the probability of misdiagnosis and missed diagnosis.
Smart Images

Figure CN122240787A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to an intelligent question-answering generation method, apparatus, device, and medium based on a multimodal model. Background Technology
[0002] Dermatological diseases are characterized by their wide variety and complex symptoms. Diagnosis relies heavily on visual features such as the morphology, color, and distribution of skin lesions, while also requiring comprehensive assessment based on textual information including patient symptom descriptions, medical history, and lifestyle habits. Traditional dermatological treatment primarily involves in-person visits, requiring patients to queue and register, which is time-consuming and laborious. Furthermore, high-quality medical resources are concentrated in large cities, often leaving patients in remote areas unable to receive timely treatment, potentially leading to delays in diagnosis and treatment.
[0003] The current development of internet healthcare has spurred the emergence of online dermatology consultation systems, but existing systems still have significant shortcomings: First, data collection methods are limited, often only supporting text descriptions or single image uploads, failing to achieve effective fusion of multimodal data, resulting in insufficient diagnostic evidence and low accuracy; second, diagnostic models are mostly based on single-modal algorithms, or rely solely on image recognition to determine lesion type, or only use text semantic analysis to infer etiology, failing to integrate visual and semantic features, making misdiagnosis and missed diagnosis prone to occur; third, the lack of doctor review is common, making it difficult to guarantee the reliability of pure AI diagnostic results and meet clinical medical safety requirements.
[0004] Multimodal large models have made significant progress in fields such as image recognition and natural language processing, possessing the ability to fuse multiple types of data and perform comprehensive reasoning. However, there is still a lack of mature solutions for applying them to dermatological consultations. Summary of the Invention
[0005] The main objective of this invention is to provide an intelligent question-and-answer generation method, device, equipment, and storage medium based on a multimodal model, aiming to solve the technical problems of existing dermatology consultation systems, such as single data collection, single model modality, and lack of doctor review.
[0006] To achieve the above objectives, this invention provides an intelligent question-answering generation method based on a multimodal model, comprising: Collect patient image data, text data, and voice data; The image data, text data, and speech data are preprocessed respectively to generate visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned according to timestamps and patient identifiers and then fused into a multimodal data sample. The multimodal data samples are input into a large multimodal model for inference, and preliminary diagnostic results and preliminary treatment suggestions are output.
[0007] Furthermore, to achieve the above objectives, the present invention provides an intelligent question-answering generation device based on a multimodal model, comprising: The data acquisition module is used to collect patients' image data, text data, and voice data; The data processing module is used to preprocess the image data, text data and voice data respectively to generate visual feature vectors, semantic feature vectors and voice intonation feature vectors; The data fusion module is used to align the visual feature vector, semantic feature vector, and speech tone feature vector according to the timestamp and patient identifier, and fuse them into a multimodal data sample. The data inference module is used to input the multimodal data samples into a large multimodal model for inference, and output preliminary diagnostic results and preliminary treatment suggestions.
[0008] Furthermore, to achieve the above objectives, the present invention also provides a computer device, the computer device including a memory, a processor, and a multimodal model-based intelligent question-answering generation program stored in the memory and executable on the processor, wherein when the multimodal model-based intelligent question-answering generation program is executed by the processor, it implements the steps of the multimodal model-based intelligent question-answering generation method as described above.
[0009] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a multimodal model-based intelligent question-answering generation program, wherein the multimodal model-based intelligent question-answering generation program, when executed by a processor, implements the steps of the multimodal model-based intelligent question-answering generation method as described above.
[0010] Beneficial Effects: This invention relates to the field of artificial intelligence technology and discloses an intelligent question-answering generation method, apparatus, device, and medium based on a multimodal model. The method includes: collecting patient image data, text data, and voice data; preprocessing the image data, text data, and voice data respectively to generate visual feature vectors, semantic feature vectors, and voice intonation feature vectors; aligning the visual feature vectors, semantic feature vectors, and voice intonation feature vectors according to timestamps and patient identifiers, and fusing them into a multimodal data sample; inputting the multimodal data sample into a large multimodal model for inference, and outputting preliminary diagnostic results and preliminary treatment suggestions. This invention can be applied to business scenarios such as healthcare, by collecting dermatology multimodal data and preprocessing it into feature vectors, then aligning and fusing them before inputting them into a large multimodal model to infer preliminary diagnostic results, thereby improving diagnostic accuracy and consultation efficiency. Attached Figure Description
[0011] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings: Figure 1 This is a schematic diagram of an application environment for an intelligent question-answering generation method based on a multimodal model, according to an embodiment of the present invention. Figure 2 This is a flowchart illustrating an embodiment of the intelligent question-answering generation method based on a multimodal model according to the present invention. Figure 3 This is a schematic diagram of the functional modules of a preferred embodiment of the intelligent question-answering generation device based on a multimodal model of the present invention; Figure 4 This is a schematic diagram of the structure of a computer device according to an embodiment of the present invention; Figure 5 This is another structural schematic diagram of a computer device according to one embodiment of the present invention. Detailed Implementation
[0012] It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the invention.
[0013] The intelligent question-answering generation method based on a multimodal model provided in this invention can be applied to, for example... Figure 1 In this application environment, the client communicates with the server via a network. The server can collect patient image data, text data, and voice data through the client; preprocess the image data, text data, and voice data respectively to generate visual feature vectors, semantic feature vectors, and voice intonation feature vectors; align the visual feature vectors, semantic feature vectors, and voice intonation feature vectors according to timestamps and patient identifiers, and fuse them into a multimodal data sample; input the multimodal data sample into a large multimodal model for inference, and output preliminary diagnostic results and preliminary treatment suggestions. This invention can be applied to business scenarios such as healthcare, by collecting dermatology multimodal data, preprocessing it into feature vectors, aligning and fusing them, and then inputting them into a large multimodal model to infer preliminary diagnostic results, thereby improving diagnostic accuracy and consultation efficiency. The client can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers. The invention will be described in detail below through specific embodiments.
[0014] Please see Figure 2 , Figure 2 This is a flowchart illustrating an embodiment of the intelligent question-answering generation method based on a multimodal model provided by the present invention. It should be noted that although the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown here.
[0015] like Figure 2 As shown, the intelligent question-answering generation method based on a multimodal model proposed in this invention includes the following steps: S100: Collect patient image data, text data, and voice data; S200: Preprocess the image data, text data and speech data respectively to generate visual feature vectors, semantic feature vectors and speech intonation feature vectors; S300. Align the visual feature vector, semantic feature vector, and speech intonation feature vector according to the timestamp and patient identifier, and fuse them into a multimodal data sample. S400. Input the multimodal data samples into the multimodal large model for inference, and output preliminary diagnostic results and preliminary treatment suggestions.
[0016] In this embodiment, during the intelligent dermatology consultation, multi-dimensional patient data collection will be carried out first. For image data, patients will be specifically guided to take clear pictures of the lesion area to accurately obtain core visual information such as the shape, color, and distribution of the lesion. For text data, both structured and unstructured input methods will be provided. Structured input will set preset fields around key dimensions such as symptom duration and past medical history, while unstructured input will allow patients to freely describe symptom details, comprehensively collecting textual information about the condition. For voice data, patients can describe their condition by voice, and automatic speech recognition technology will be integrated to convert the speech into text, while also retaining the voice intonation feature data to provide auxiliary basis for subsequent diagnosis.
[0017] After completing multimodal data acquisition, specialized preprocessing is performed on each type of data. For image data, denoising, color correction, resizing, and compression are performed sequentially to eliminate interference information and adapt to model requirements. The data is then encoded into visual feature vectors that can be recognized by the model. For text data, structured text undergoes field validation and completion, while unstructured text and speech-to-text are processed using natural language processing methods such as word segmentation, stop word removal, and part-of-speech tagging. These are then converted into semantic feature vectors using Word2Vec or BERT models. For speech intonation feature data, corresponding feature extraction and vectorization are performed. Finally, through data alignment, the processed visual feature vectors, semantic feature vectors, and speech intonation feature vectors are precisely matched and integrated according to timestamps and unique patient identifiers to form a unified, information-related multimodal data sample, ensuring the correlation and consistency of data across different dimensions.
[0018] After multimodal data samples are generated, they are input into a professionally trained multimodal model for diagnostic inference. This model is built on the Transformer architecture, with dedicated branches for image, text, and speech in the encoder section. These branches encode the feature vectors of their respective modalities and achieve deep fusion of multimodal features through a cross-attention mechanism. Simultaneously, it incorporates a built-in dermatology disease knowledge base, covering symptoms, diagnostic criteria, and treatment plans for common skin diseases. The model autonomously determines whether the collected information meets diagnostic requirements. If sufficient information is available, the model outputs a preliminary diagnosis including the disease name and diagnostic confidence level. It also analyzes possible causes based on the patient's condition and provides targeted preliminary treatment suggestions, including medication recommendations, nursing guidance, and necessary further examination suggestions. If information is insufficient, the model proactively asks supplementary questions to continue collecting relevant patient information, ensuring a comprehensive diagnosis.
[0019] This embodiment significantly improves diagnostic accuracy. By fusing multimodal data from images, text, and speech, it fully leverages the complementary value of visual and semantic features. Combined with the deep reasoning capabilities of a multimodal large-scale model, it effectively integrates various key information required for dermatological diagnosis, significantly reducing the probability of misdiagnosis and missed diagnosis. Secondly, it achieves a comprehensive improvement in consultation efficiency. For patients, there is no need to queue offline for registration; they can complete multimodal data collection and consultation anytime through online terminals, saving a significant amount of time and effort. For doctors, the model can autonomously and efficiently collect disease information and generate a structured overview, while simultaneously outputting preliminary diagnoses and treatment suggestions. Doctors only need to review and optimize, greatly shortening the overall consultation cycle and effectively alleviating the time cost pressure of medical services. Furthermore, this technology makes the collection and processing of multidimensional disease information more comprehensive. Compared to single-modal data collection and analysis, it retains the core basis of dermatological diagnosis—visual features of skin lesions—while also collecting information such as the patient's subjective symptoms and medical history through text and speech, and using speech tone features as an auxiliary judgment basis, making the information foundation for diagnosis more solid. Furthermore, standardized preprocessing and data alignment processes ensure the effectiveness and relevance of multimodal data, making the reasoning process of large models more logical, and the output diagnostic results and treatment suggestions are more in line with the actual condition of patients, providing a reliable reference for subsequent medical decisions.
[0020] For example, in the healthcare field, this technology can be widely applied to online dermatology consultations, serving as an important supplement to offline treatments. It can also be extended to remote consultations in other departments that rely on visual information and symptom descriptions. On internet healthcare platforms, patients can upload images of skin lesions, fill in text descriptions of their symptoms, and send voice descriptions at any time. The system automatically completes multimodal data collection and preprocessing, quickly generating feature vectors and fusing them into data samples. After multimodal large-scale model inference, it provides preliminary diagnosis and treatment suggestions, significantly improving the efficiency and professionalism of online consultations. Primary healthcare institutions and medical points in remote areas can leverage this technology to compensate for the lack of specialized dermatology doctors. The system can quickly complete preliminary disease assessments, providing patients with basic treatment suggestions and accurate references for subsequent referrals. Furthermore, this technology can be applied to home healthcare terminals, allowing users to perform self-service preliminary screening for skin problems, promptly identify conditions, and take appropriate measures. It can also provide auxiliary inference support for clinical dermatology diagnosis and treatment in hospitals, helping doctors quickly integrate multi-dimensional patient information, improving the efficiency of offline treatments, and promoting the digitalization, intelligentization, and universalization of healthcare services.
[0021] In one embodiment, S100 includes: S101, The image acquisition unit acquires the image of the skin lesion area uploaded by the user; S102. The text input unit obtains the structured text data and unstructured text data filled in by the user; S103. The voice acquisition unit acquires the voice data uploaded by the user.
[0022] In this embodiment, during the multimodal data acquisition stage of intelligent dermatology consultation, the image acquisition unit specifically conducts guided acquisition of images of the lesion area. It provides users with shooting instructions to help them clearly capture the core area of the lesion, accurately capturing key visual features such as the morphology, color, distribution, and size of the lesion. This ensures that the uploaded images accurately and completely reflect the visual manifestations of the skin condition, providing reliable visual evidence for subsequent diagnosis. The text input unit uses a combination of structured and unstructured methods to acquire text data. For structured text data, it sets pre-defined standardized fields such as symptom duration, past skin disease history, allergy history, medication use, and living environment, guiding users to fill them in one by one to ensure the completeness and standardization of basic medical information. For unstructured text data, it provides a free input interface, allowing users to describe in detail their subjective symptoms, development and changes, triggering or alleviating factors, and other personalized details, fully collecting subjective information about the user's condition. The voice acquisition unit provides users with a convenient channel for uploading voice data. Users can intuitively express their feelings about their illness and details of their symptoms through voice. The unit will simultaneously receive and store the voice data, and integrate automatic speech recognition technology to accurately convert the voice content into text data. It will also retain the tone, speed and other characteristic data of the voice to supplement relevant information for subsequent diagnosis.
[0023] This multi-dimensional data acquisition technology boasts numerous advantages. Firstly, it achieves comprehensive collection of patient information. The image acquisition unit captures crucial visual features of skin lesions for dermatological diagnosis, while the text input unit gathers standardized basic medical history and personalized symptom descriptions. Simultaneously, the voice acquisition unit balances user convenience with the auxiliary value of voice features. This multi-dimensional data complements each other, making the collection of patient information more complete and laying a solid information foundation for subsequent diagnosis. Secondly, it enhances the convenience and flexibility of user operation. It provides three different information submission methods—image, text, and voice—to cater to different user habits and expression needs. Structured text filling reduces the difficulty of information organization for users, while unstructured text and voice input meet the user's need for free expression. All acquisition methods are simple to operate and adaptable to users of different ages and operational abilities, effectively improving the user experience of consultations. Furthermore, the accuracy of medical information is ensured. The guided design of image acquisition reduces invalid captures and ensures the effectiveness of visual information; the pre-set fields of structured text prevent the omission of basic medical information; and the combination of speech recognition and feature preservation not only realizes the text conversion of speech information but also retains additional auxiliary judgment information, allowing all types of collected data to accurately reflect the user's actual condition. In addition, the text acquisition method that combines standardization and personalization facilitates subsequent data preprocessing and model recognition while fully collecting personalized medical details of users, balancing data processing efficiency and the richness of medical information.
[0024] For example, in the dermatology field of healthcare, this is a core foundational element of online intelligent consultation, providing an efficient information collection solution for telemedicine and primary care. In internet healthcare platforms and online dermatology consultation terminals, the image acquisition unit guides users to take standardized images of skin lesions, accurately capturing the core visual features of skin diseases. The text input unit collects standardized information such as medical history and symptom duration through structured preset fields, while also allowing users to freely describe symptom details using unstructured input. The voice acquisition unit provides a convenient channel for users with operational difficulties or who prefer voice expression, and also retains intonation characteristics to aid in disease diagnosis. This technology can also be applied to primary healthcare institutions, helping medical staff quickly collect multi-dimensional patient information, providing a complete basis for subsequent diagnosis; it is also compatible with home healthcare terminals, allowing users to independently collect skin disease-related information without leaving home, enabling preliminary disease reporting. Furthermore, in telemedicine consultations, this data collection method can simultaneously transmit complete and multi-dimensional patient condition data to experts in different locations, making remote diagnosis more targeted, effectively compensating for the time and space limitations of offline diagnosis and treatment, promoting the digitalization, convenience and accessibility of dermatology medical services, and facilitating the efficient flow of high-quality medical resources.
[0025] In one embodiment, S200 includes: S201. Perform noise reduction, color correction, and size compression processing on the image of the skin lesion area; S202. Encode and convert the processed skin lesion area image to generate a visual feature vector; S203. Perform field validation and field completion processing on the structured text data; S204. Analyze and process the speech data, extract speech intonation features, and convert the speech data into text. S205. Perform word segmentation, stop word removal, and part-of-speech tagging on the unstructured text data and text; S206. Convert the processed structured text data, unstructured text data, and text into semantic feature vectors.
[0026] In this embodiment, during the data preprocessing stage of the dermatology intelligent consultation, the images of the skin lesion area undergo denoising, color correction, and size compression processing in sequence. Denoising eliminates interference factors such as noise and blurring in the image; color correction restores the true color characteristics of the skin lesions, avoiding color deviations caused by shooting light and equipment; and size compression adjusts the image specifications according to the model's operational needs, improving subsequent processing efficiency while ensuring feature integrity. After basic processing, the image is professionally encoded, transforming visual information into visual feature vectors that the model can recognize and calculate, accurately extracting core visual features such as the morphology, distribution, and color of the skin lesions. For structured text data, strict field validation and completion are performed. The validation stage checks the rationality and standardization of the filled content, identifying missing or incorrect information. The completion stage provides prompts or supplements for required but unfilled fields, ensuring the integrity and standardization of structured data such as basic medical history and symptom information. For speech data, intonation and speech rate features are accurately extracted from the speech signal. Simultaneously, automatic speech recognition technology is used to convert the speech content completely and accurately into text, preserving auxiliary judgment information in the speech while achieving the conversion of speech data into text format. For unstructured text data and the text converted from speech, natural language processing is performed, sequentially completing word segmentation, stop word removal, and part-of-speech tagging. Word segmentation breaks continuous text into independent words, stop word removal eliminates non-semantic modifiers and conjunctions, and part-of-speech tagging clarifies the grammatical attributes of words. After the above processing, a Word2Vec or BERT model is used to uniformly convert the processed structured text data, unstructured text data, and speech-to-text into semantic feature vectors, realizing the digital and vectorized expression of text information.
[0027] This comprehensive data preprocessing technology design boasts numerous significant advantages. Firstly, it ensures the validity and accuracy of the data. Operations such as image denoising and color correction restore the true visual characteristics of skin lesions, allowing visual feature vectors to accurately reflect the actual condition of skin diseases. Verification and completion of structured text avoids missing or erroneous basic information, providing a standardized foundation for semantic feature extraction. Simultaneous extraction of intonation features and text conversion in speech processing not only uncovers auxiliary diagnostic information in speech but also achieves unified processing of multiple types of text data, comprehensively improving the quality of various data types. Secondly, it achieves data standardization and normalization. Through encoding and vectorization, raw data of different types and formats, such as images, text, and speech, are uniformly converted into feature vectors that the model can directly process. This solves the problem of inconsistent multimodal data formats, laying the foundation for subsequent deep fusion of multimodal features and significantly improving the efficiency of model inference. Furthermore, it enhances the targeting and professionalism of data processing. Specifically tailored to the characteristics of dermatological consultations, it performs specialized visual processing on skin lesion images and employs differentiated preprocessing strategies for different types of text data. This aligns with the information needs of dermatological disease diagnosis, ensuring that the extracted visual and semantic feature vectors accurately capture core information related to the disease. In addition, the standardized preprocessing workflow gives the data good versatility and adaptability. It can efficiently support inference calculations for multimodal large models and provide clear and standardized processed data for subsequent doctor review, balancing the efficiency of AI processing with the professionalism of medical diagnosis. Simultaneously, it allows the system to stably process various types of data uploaded by different users, improving the system's compatibility and robustness.
[0028] For example, in intelligent dermatology healthcare systems, this technology is widely applied in various healthcare scenarios, including internet healthcare platforms, remote dermatology consultations, primary healthcare institutions, and home healthcare terminals. In various online consultation systems, this technology can process patient-uploaded skin lesion images through denoising and color correction, restoring the true visual features of the lesions and generating visual feature vectors. Simultaneously, it verifies and completes structured text, extracts speech intonation features, and performs speech-to-text conversion. Furthermore, it processes the text data through word segmentation to generate semantic feature vectors, transforming non-standardized multimodal raw data into standardized feature data recognizable by the model. Primary hospitals, leveraging this technology, can quickly and systematically process patient skin lesion images and disease descriptions, providing high-quality data support for AI-assisted diagnosis or remote consultations. This technology is also compatible with home self-service consultation devices, standardizing user-collected multimodal data related to skin problems and providing a reliable data foundation for subsequent AI preliminary screening. It has enabled the standardization and digital transformation of multimodal medical data in dermatology, laying a data foundation for subsequent multimodal feature fusion and intelligent diagnosis, improving the accuracy and efficiency of intelligent medical care in dermatology, and promoting the digitalization and standardization of dermatological diagnosis and treatment.
[0029] In one embodiment, S300 includes: S301. Generate a globally unique patient identifier for the patient; S302. Time-stamp the visual feature vector, semantic feature vector, and speech intonation feature vector according to the timestamp; S303. Add modality type labels to the visual feature vector, semantic feature vector, and speech intonation feature vector; S304. Align the visual feature vector, semantic feature vector, and speech tone feature vector using globally unique patient identifiers, time stamps, and modality type labels to generate a multimodal feature vector correlation matrix. S305. The multimodal feature vector correlation matrix is fused to generate multimodal data samples.
[0030] In this embodiment, during the multimodal data integration and processing stage, a globally unique patient identifier is first generated for each patient. This identifier serves as the core basis for associating various types of patient diagnostic and treatment data, ensuring that disease data from different sources and modalities can be accurately attributed to the corresponding patient. Subsequently, according to the data collection timestamp, the preprocessed visual feature vector, semantic feature vector, and speech intonation feature vector are uniformly time-stamped, clearly recording the collection time of various feature data and restoring the temporal dimension characteristics of the patient's disease information. At the same time, exclusive modality type labels are added to the three types of feature vectors to clearly distinguish the feature data of different modalities such as visual, semantic, and speech intonation, making the attribute characteristics of various data readily apparent. Based on this, using the globally unique patient identifier as the core association clue, combined with the timestamp, feature data matching is achieved within the same time dimension and for the same patient. Then, relying on the modality type label, the classification and association of different modality feature vectors are completed. Through the synergistic effect of the three identifiers, multi-dimensional data alignment is achieved, ultimately constructing a multimodal feature vector association matrix. The matrix clearly presents the various modality feature vectors of the same patient within the same time dimension and their interrelationships. Finally, the association matrix is subjected to deep fusion processing to break down the information barriers between different modal data, integrate the scattered visual, semantic, and speech tone feature information into a whole, and generate multimodal data samples with unified format, complete information, and close correlation, providing an integrated data analysis foundation for subsequent diagnostic reasoning.
[0031] This embodiment achieves precise association and orderly integration of multimodal data. Through triple positioning using globally unique patient identifiers, time stamps, and modality type labels, it completely solves the problems of ambiguous attribution, disordered temporal sequence, and mixed types in multi-source, multimodal data. This allows feature vectors of different modalities to be accurately matched according to patient and time dimensions, ensuring data correlation and orderliness, and laying a standardized foundation for subsequent fusion analysis. Secondly, the construction of the multimodal feature vector association matrix visualizes and structures the relationships between various feature data, facilitating quick retrieval and viewing of specific modal data for a particular patient or time period. It also makes subsequent fusion processing more targeted, accurately mining complementary information between different modalities and improving the efficiency and effectiveness of feature fusion. Furthermore, the standardized identification, labeling, and alignment process enables the system to efficiently process large volumes of patient multimodal data, significantly improving the scalability of data processing and adapting to the massive user needs of online consultation systems. Meanwhile, the resulting unified multimodal data samples can be directly recognized and inferred by the multimodal large model without requiring additional data parsing and matching, effectively improving the efficiency of model diagnostic reasoning and enabling the model to more quickly integrate various feature information for disease diagnosis. Furthermore, this data processing method facilitates subsequent doctor review and data traceability. Doctors can quickly locate various original feature data and fused sample data of patients through patient identifiers and time stamps, clearly viewing the source and correlation of the data. This also facilitates subsequent review, analysis, and research of diagnostic and treatment data, meeting the dual needs of clinical application and medical data management.
[0032] For example, this data alignment and fusion technology is a key component of intelligent dermatology healthcare, widely applied in internet healthcare platforms, remote dermatology consultations, and intelligent consultation systems in primary healthcare institutions. In various online dermatology consultation platforms, the system generates a globally unique identifier for each patient. Combining timestamps and modality type tags, it precisely aligns preprocessed visual, semantic, and speech tone feature vectors, constructing an association matrix and fusing it into a unified multimodal data sample. This allows dispersed multi-source data to form a structured and integrated foundation for diagnostic and treatment data. Primary hospitals can leverage this technology to standardize the processing of multimodal patient data, providing standardized samples for AI-assisted diagnosis and integrating complete and related patient data for remote consultations. Simultaneously, this technology is compatible with regional medical data platforms, enabling standardized management of dermatology diagnostic and treatment data, and facilitating cross-platform and cross-scenario data tracing through unique identifiers. Furthermore, in the clinical application of dermatology AI models, standardized multimodal data samples provide accurate and complete input for model inference, significantly improving the accuracy of intelligent diagnosis, promoting the digital integration and intelligent application of dermatology medical data, and contributing to the dual improvement of medical service efficiency and professionalism.
[0033] In one embodiment, prior to S400, the following is included: Pre-build a large multimodal model based on the Transformer architecture; The visual feature vectors are encoded using a multimodal large-scale image encoder to generate visual vector codes; The semantic feature vector is encoded by a multimodal large-scale model text encoder to generate text vector encoding; The speech intonation feature vector is encoded by a multimodal large-scale speech encoder to generate speech vector encoding; A cross-attention mechanism is used to deeply fuse the visual vector encoding, text vector encoding, and speech vector encoding to generate multimodal vector encoding; The multimodal vector encoding is input into the decoder of the multimodal large model, and the model is trained in conjunction with the disease knowledge base until the diagnostic accuracy of the multimodal large model reaches a preset threshold.
[0034] In this embodiment, during the model building and training phase of the dermatology intelligent consultation system, a multimodal large-scale model based on the Transformer architecture is pre-built. This model is specifically designed for the needs of dermatology consultation, with independent image, text, and speech branches set up on the encoder side, and the decoder side adapting to scenario requirements such as diagnostic result output and supplementary consultation questions, laying the architectural foundation for the encoding, fusion, and inference of multimodal features. Based on this architecture, the model performs deep encoding on the preprocessed visual feature vectors through a dedicated image encoder, mining the deep correlation information of core visual features such as morphology, color, and distribution in the skin lesion image, and generating structured visual vector codes; it performs encoding processing on the semantic feature vectors through a text encoder, extracting the semantic logic and key features of information such as symptom description, medical history, and lifestyle habits in the text, and generating text vector codes; and it performs professional encoding on the speech intonation feature vectors through a speech encoder, capturing the potential information contained in features such as intonation and speech rate in the speech, and generating speech vector codes. After each modality is encoded individually, the model employs a cross-attention mechanism to deeply fuse the visual, text, and speech vector codes. This mechanism allows the encoded information from different modalities to interact and correlate with each other, uncovering complementary relationships and intrinsic connections between features of different modalities, breaking down information barriers of a single modality, and generating multimodal vector codes that integrate multidimensional feature information. These multimodal vector codes are then input into the model's decoder, and the model is trained using a built-in dermatology disease knowledge base. This knowledge base covers professional medical content such as symptoms, diagnostic criteria, and treatment plans for common skin diseases. During training, the model continuously matches, infers, and learns from the fused multimodal features and the knowledge base content, constantly optimizing model parameters until the model's diagnostic accuracy for dermatology diseases reaches a preset threshold, completing the model training and optimization.
[0035] This implementation leverages the Transformer architecture and a dedicated multimodal branch design, enabling the model to selectively process different types of feature data, including visual, text, and speech data. Each encoder can accurately extract the core features of its corresponding modality, ensuring the professionalism and accuracy of single-modal feature extraction. Secondly, the application of a cross-attention mechanism achieves deep fusion of multimodal features, rather than simple information splicing. This effectively uncovers the intrinsic connections and complementary values between different modal features. For example, combining the visual features of skin lesions with the patient's subjective symptoms described in text and the state features reflected in speech allows the model to form a more comprehensive diagnostic basis, fundamentally improving the model's diagnostic reasoning ability. Furthermore, training with a professional dermatology disease knowledge base ensures that the model's reasoning always aligns with professional clinical standards, avoiding the problem of unstandardized model training. This guarantees the professionalism and scientific validity of the diagnostic results and treatment suggestions output by the model after training, making the model's diagnosis more in line with clinical practice guidelines. Simultaneously, using a preset threshold for diagnostic accuracy as the training termination condition allows for precise control of the model's training effect, ensuring that the model possesses sufficient diagnostic capabilities before being deployed in practical applications, effectively reducing the probability of misdiagnosis and missed diagnosis. Furthermore, the model training method has good optimizability. It can be continuously fine-tuned and updated by combining clinical diagnosis and treatment data to continuously improve the model's diagnostic accuracy and applicability. This allows the model to adapt to the diagnostic needs of more dermatological diseases, while also adapting to the updates and iterations of medical knowledge and treatment plans, ensuring the long-term practicality and professionalism of the system.
[0036] For example, in intelligent dermatology healthcare, this technology is widely applied in various healthcare scenarios, including internet healthcare platforms, hospital intelligent diagnosis and treatment systems, and AI-assisted diagnostic terminals in primary healthcare. In the intelligent dermatology consultation system, a multimodal large-scale model based on the Transformer architecture is first built. Visual, semantic, and speech tone feature vectors are encoded separately using a dedicated encoder. Then, a cross-attention mechanism is used to achieve deep fusion of the multimodal encodings. This is combined with a dermatology disease knowledge base for targeted training until the diagnostic accuracy reaches the target. This technology can be applied to dermatology auxiliary diagnosis in hospitals at all levels, making the model an intelligent diagnostic assistant for doctors, quickly integrating multimodal disease data to provide professional references. It is also compatible with online consultation systems on internet healthcare platforms, achieving efficient AI preliminary diagnosis through a maturely trained model. Simultaneously, it can be implemented in primary healthcare institutions in remote areas, compensating for the shortage of professional dermatologists and providing standardized, highly accurate intelligent diagnostic support for primary care. Furthermore, this technology can also be used in the research and development of dermatology medical AI products. Through continuous training and optimization of the model, its diagnostic capabilities are constantly improved, promoting the intelligent and standardized development of dermatology diagnosis and treatment, and facilitating the widespread accessibility of high-quality medical resources.
[0037] In one embodiment, S400 includes: S401. Input the multimodal data sample into the diagnostic reasoning unit of the multimodal large model; S402. The diagnostic reasoning unit integrates visual feature vectors and semantic feature vectors, and combines them with the disease knowledge base to perform reasoning to determine whether the user's skin lesion-related information meets the diagnostic requirements. S403. If the skin lesion information meets the diagnostic requirements, output the preliminary diagnosis results and preliminary treatment suggestions; S404. If the skin lesion information does not meet the diagnostic requirements, output supplementary consultation questions.
[0038] In this embodiment, during the diagnostic reasoning stage of intelligent dermatology consultation, the unified multimodal data sample generated in the preprocessing stage is completely input into the diagnostic reasoning unit of the multimodal large model. This unit is the core module for the model to make disease judgments and can efficiently receive and parse integrated data samples that fuse visual, semantic, and voice tone features. After entering the reasoning stage, the diagnostic reasoning unit will deeply mine the core information of visual feature vectors and semantic feature vectors in the multimodal data sample to achieve the organic fusion of the two types of key features. It will fully combine the objective visual manifestations of skin lesions, such as morphology, color, and distribution, with the subjective semantic information such as the patient's symptom description, medical history, lifestyle habits, and subjective feelings. At the same time, it will call up the built-in dermatology disease knowledge base as the basis for reasoning. This knowledge base covers professional medical content such as the symptom characteristics, clinical diagnostic criteria, typical causes, and corresponding treatment plans of various common skin diseases. The reasoning unit will accurately match and logically reason with the fused feature information and the professional content in the knowledge base to comprehensively judge whether the currently collected user skin lesion-related information is sufficient to support an accurate disease diagnosis. If the reasoning determines that the skin lesion-related information is complete and sufficient to meet the professional requirements for diagnosis, the diagnostic reasoning unit will directly output a preliminary diagnosis result, including the specific disease name and corresponding diagnostic confidence level. Simultaneously, it will combine the disease-matching knowledge base to provide targeted preliminary treatment suggestions, covering applicable medication recommendations, daily care guidance, and necessary further examination suggestions. If the reasoning determines that the current skin lesion-related information is incomplete or lacks clear key features, and therefore does not meet the diagnostic requirements, the diagnostic reasoning unit will intelligently output targeted supplementary consultation questions based on the information gaps, continuing to collect missing key disease information from the user to lay the foundation for an accurate diagnosis.
[0039] This embodiment achieves comprehensiveness and professionalism in diagnostic criteria. By integrating visual and semantic features, it deeply combines crucial objective visual information of skin lesions and the patient's subjective description of their condition, while relying on a professional disease knowledge base for reasoning. This ensures that the diagnostic judgment is supported by multi-dimensional actual condition information and conforms to professional clinical standards, significantly improving the accuracy of preliminary diagnostic results and effectively reducing the probability of misdiagnosis and missed diagnosis. Secondly, it possesses intelligent information judgment and supplementation capabilities. The diagnostic reasoning unit does not simply output diagnostic results but first professionally assesses the completeness of the information. For cases with insufficient information, it proactively generates supplementary consultation questions, achieving a dynamic and intelligent consultation process. This avoids diagnostic biases caused by missing information, making subsequent diagnostic results more reliable. Furthermore, it significantly improves the efficiency and flexibility of consultation. For cases with sufficient information, diagnostic results and treatment suggestions can be quickly output, reducing unnecessary consultation steps and saving time for both patients and doctors. For cases with insufficient information, precise supplementary consultations can directly address the information gaps, avoiding ineffective questions and making the consultation process more targeted, thus improving overall consultation efficiency. Meanwhile, this reasoning model fully leverages the deep reasoning capabilities of multimodal large models, breaking through the limitations of single-modal data reasoning. It allows for complementary information from different dimensions of the patient's condition, maximizing the diagnostic value within the data and providing detailed and professional preliminary diagnostic references for subsequent doctor review, thus reducing the workload for doctors. Furthermore, the standardized reasoning process, combined with a professional knowledge base, ensures that diagnostic results and treatment recommendations have consistent clinical basis. The intelligent design of the consultation process allows the system to adapt to different users' descriptions of their symptoms, balancing diagnostic professionalism with system adaptability. This makes it suitable for various online consultation scenarios, effectively enhancing the system's practical value.
[0040] For example, in dermatology consultations, intelligent online diagnosis is widely used in various healthcare scenarios, including internet medical platforms, hospital dermatology auxiliary diagnosis and treatment systems, and AI diagnostic terminals in primary healthcare. In various online dermatology consultation platforms, the system inputs standardized multimodal data samples into a diagnostic reasoning unit. This unit integrates visual and semantic features with a dermatology disease knowledge base to complete professional reasoning. It first determines whether the skin lesion information meets diagnostic requirements. If sufficient information is available, it directly outputs a preliminary diagnosis result containing the disease name and confidence level, as well as treatment suggestions such as medication recommendations and nursing guidance. If insufficient information is available, it intelligently generates supplementary consultation questions and continuously collects key disease information. This technology is also applicable to offline hospital treatments, enabling doctors to quickly integrate patients' multimodal disease data and provide preliminary diagnostic references, improving treatment efficiency. Simultaneously, it can be implemented in primary healthcare institutions in remote areas, filling the gap in professional dermatologist resources and providing standardized diagnostic reasoning support for primary healthcare personnel. Furthermore, it can be adapted to home healthcare terminals to achieve self-service intelligent preliminary screening of skin problems, helping patients to detect conditions promptly, promoting the intelligent and convenient development of dermatology diagnosis and treatment, and extending high-quality diagnostic reasoning capabilities to more medical scenarios.
[0041] In one embodiment, after S400, the following is included: Match the user with the appropriate attending physician based on their skin lesion information; Send image data, text data, voice data, preliminary diagnostic results, and preliminary treatment suggestions to the doctor's user terminal; The doctor obtains the doctor's modification suggestions on the preliminary diagnosis and preliminary treatment recommendations through the review and editing unit on the doctor's user end, and generates the final diagnosis and final treatment recommendations. The final diagnosis and treatment recommendations will be sent to the patient's client.
[0042] In this embodiment, during the doctor review phase of the intelligent dermatology consultation, the system first intelligently matches the user with appropriate dermatologists based on the user's skin lesion information, combined with the doctor's area of expertise, clinical experience, and current workload, ensuring the professionalism and relevance of the review. After matching, the system simultaneously sends the patient's original image data, text data, voice data, and the preliminary diagnosis and treatment suggestions output by the multimodal large model to the corresponding doctor's user terminal. This allows the doctor to fully grasp all the patient's condition information and the AI's preliminary diagnosis, providing detailed and complete reference for the review. The doctor, through the review editing unit on the user terminal, combines their own clinical experience to professionally review and evaluate the preliminary diagnosis and treatment suggestions provided by the AI. They can modify and supplement the diagnosis based on the actual condition, adjust and optimize the treatment suggestions, and add personalized medical notes. The system then integrates the doctor's modifications and suggestions to generate a final diagnosis and treatment suggestion with clinical professionalism. Finally, the system will promptly push the final diagnosis, final treatment recommendations, and medical notes, which have been professionally reviewed by the doctor, to the patient's user terminal, allowing the patient to obtain authoritative and accurate diagnosis and treatment conclusions.
[0043] In this embodiment, firstly, precise matching of diagnostic and treatment resources is achieved by combining the patient's skin lesion condition with the doctor's professional expertise, ensuring that the review work is completed by a professionally suited doctor. This significantly improves the professionalism and accuracy of the review results, guaranteeing the quality of diagnosis and treatment. Secondly, doctors can view the patient's full multimodal raw data and preliminary AI diagnostic information on the user end. This eliminates the need to recollect patient information and allows for a comprehensive understanding of the details of the patient's condition and the preliminary conclusions of the model's reasoning, greatly reducing the doctor's workload and improving the efficiency of the review process. Furthermore, the review editing unit provides doctors with ample professional adjustment space, allowing their clinical experience to effectively compensate for the limitations of AI diagnosis. This achieves an organic combination of AI technology and clinical expertise, leveraging the efficiency of AI while ensuring the safety and authority of medical diagnosis, fundamentally reducing the possibility of misdiagnosis and missed diagnosis. Simultaneously, the efficient synchronization and feedback of diagnostic and treatment information between the patient and doctor user ends allows patients to quickly obtain the final diagnosis and treatment results reviewed by a professional doctor, eliminating the need for repeated offline trips, improving the patient's consultation experience, and further shortening the overall treatment cycle. Furthermore, this process forms a dual diagnosis and treatment guarantee system of "AI preliminary diagnosis + doctor professional review", which takes into account the efficiency of online consultation and the professionalism of medical services. It enables online dermatology consultation to efficiently connect doctors and patients, strictly follow clinical diagnosis and treatment standards, and is applicable to various online medical service scenarios, effectively promoting the online popularization of high-quality dermatology medical resources.
[0044] For example, the doctor review and treatment process in the healthcare field is mainly implemented in scenarios such as online dermatology consultations, telemedicine consultations, and collaborative diagnosis and treatment in primary healthcare settings. It is a key link connecting AI preliminary diagnosis with professional clinical treatment. In internet healthcare platforms and hospital dermatology intelligent diagnosis and treatment systems, the system accurately matches the attending physician with the corresponding specialty based on information such as the type of skin lesions and the complexity of the condition. Then, it synchronizes the patient's full set of multimodal raw data with the AI preliminary diagnosis and treatment suggestions to the doctor's user end, allowing the doctor to have a comprehensive understanding of the patient's condition. The doctor uses the on-device review and editing unit to professionally review, modify, and supplement the AI conclusions based on clinical experience, generating a final treatment plan. The system then pushes the results to the patient's end and also supports online doctor-patient interaction and Q&A. This process is also applicable to primary healthcare institutions in remote areas. By connecting with dermatologists in higher-level hospitals for remote review, it bridges the gap in professional resources at the grassroots level, allowing patients in these areas to receive high-quality medical services. Meanwhile, this process has been implemented in the online diagnosis and treatment sections of hospitals at all levels, significantly improving the efficiency of dermatology consultations and forming a dual guarantee of "AI assistance + professional doctor review," which balances diagnosis and treatment efficiency and medical safety, and promotes the digitalization and inclusive development of dermatology medical services.
[0045] In one embodiment, a multimodal model-based intelligent question-answering generation device is provided, which corresponds one-to-one with the multimodal model-based intelligent question-answering generation method described in the above embodiments. (Refer to...) Figure 3 , Figure 3 This is a schematic diagram of the functional modules of a preferred embodiment of the intelligent question-answering generation device based on a multimodal model of the present invention. The module includes a data acquisition module 10, a data processing module 20, a data fusion module 30, and a data reasoning module 40. Detailed descriptions of each functional module are as follows: Data acquisition module 10 is used to acquire patient image data, text data and voice data; Data processing module 20 is used to preprocess the image data, text data and voice data respectively to generate visual feature vectors, semantic feature vectors and voice intonation feature vectors; The data fusion module 30 is used to align the visual feature vector, semantic feature vector and voice intonation feature vector according to the timestamp and patient identifier, and fuse them into a multimodal data sample. The data reasoning module 40 is used to input the multimodal data samples into the multimodal large model for reasoning, and output preliminary diagnostic results and preliminary treatment suggestions.
[0046] In one embodiment, the data acquisition module 10 includes: The image acquisition unit acquires images of the skin lesion area uploaded by the user; The text input unit obtains structured and unstructured text data entered by the user. The voice data uploaded by the user is acquired by the voice acquisition unit.
[0047] In one embodiment, the data processing module 20 includes: The image of the skin lesion area is subjected to noise reduction, color correction, and size compression. The processed lesion area image is encoded and converted to generate a visual feature vector; The structured text data is subjected to field validation and field completion processing; The speech data is analyzed and processed to extract speech intonation features, and then the speech data is converted into text. The unstructured text data and text text are processed by word segmentation, stop word removal, and part-of-speech tagging; The processed structured text data, unstructured text data, and text are converted into semantic feature vectors.
[0048] In one embodiment, the data fusion module 30 includes: Generate a globally unique patient identifier for the patient; The visual feature vector, semantic feature vector, and speech intonation feature vector are time-stamped according to the timestamp. Add modality type labels to visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned using globally unique patient identifiers, time stamps, and modality type labels to generate a multimodal feature vector correlation matrix. The correlation matrix of the multimodal feature vectors is fused to generate multimodal data samples.
[0049] In one embodiment, the model training module includes: Pre-build a large multimodal model based on the Transformer architecture; The visual feature vectors are encoded using a multimodal large-scale image encoder to generate visual vector codes; The semantic feature vector is encoded by a multimodal large-scale model text encoder to generate text vector encoding; The speech intonation feature vector is encoded by a multimodal large-scale speech encoder to generate speech vector encoding; A cross-attention mechanism is used to deeply fuse the visual vector encoding, text vector encoding, and speech vector encoding to generate multimodal vector encoding; The multimodal vector encoding is input into the decoder of the multimodal large model, and the model is trained in conjunction with the disease knowledge base until the diagnostic accuracy of the multimodal large model reaches a preset threshold.
[0050] In one embodiment, the data inference module 40 includes: The multimodal data samples are input into the diagnostic inference unit of the multimodal large model; The diagnostic reasoning unit integrates visual feature vectors and semantic feature vectors, and combines them with the disease knowledge base to perform reasoning to determine whether the user's skin lesion-related information meets the diagnostic requirements. If the skin lesion information meets the diagnostic requirements, a preliminary diagnosis and preliminary treatment recommendations will be output. If the skin lesion information does not meet the diagnostic requirements, supplementary consultation questions will be output.
[0051] In one embodiment, the review and diagnostic module includes: Match the user with the appropriate attending physician based on their skin lesion information; Send image data, text data, voice data, preliminary diagnostic results, and preliminary treatment suggestions to the doctor's user terminal; The doctor obtains the doctor's modification suggestions on the preliminary diagnosis and preliminary treatment recommendations through the review and editing unit on the doctor's user end, and generates the final diagnosis and final treatment recommendations. The final diagnosis and treatment recommendations will be sent to the patient's client.
[0052] Specific limitations regarding the intelligent question-answering generation device based on multimodal models can be found in the aforementioned limitations on the intelligent question-answering generation method based on multimodal models, and will not be repeated here. Each module in the aforementioned intelligent question-answering generation device based on multimodal models can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.
[0053] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides determination and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with external clients via a network connection. When the computer program is executed by the processor, it implements the functions or steps of a server-side intelligent question-answering generation method based on a multimodal model.
[0054] In one embodiment, a computer device is provided, which may be a client, and its internal structure diagram may be as follows: Figure 5 As shown, the computer device includes a processor, memory, network interface, display screen, and input devices connected via a system bus. The processor provides determination and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with an external server via a network connection. When the computer program is executed by the processor, it implements client-side functions or steps of a multimodal model-based intelligent question-answering generation method.
[0055] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps: Collect patient image data, text data, and voice data; The image data, text data, and speech data are preprocessed respectively to generate visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned according to timestamps and patient identifiers and then fused into a multimodal data sample. The multimodal data samples are input into a large multimodal model for inference, and preliminary diagnostic results and preliminary treatment suggestions are output.
[0056] In one embodiment, a computer-readable storage medium is provided, which may be non-volatile or volatile, and a computer program is stored thereon, which, when executed by a processor, performs the following steps: Collect patient image data, text data, and voice data; The image data, text data, and speech data are preprocessed respectively to generate visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned according to timestamps and patient identifiers and then fused into a multimodal data sample. The multimodal data samples are input into a large multimodal model for inference, and preliminary diagnostic results and preliminary treatment suggestions are output.
[0057] It should be noted that the functions or steps that can be implemented by the computer-readable storage medium or computer device described above can be referred to the relevant descriptions on the server side and client side in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.
[0058] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0059] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0060] It should be noted that any AI models, software tools, or components not belonging to this company appearing in the embodiments of this application are merely illustrative examples and do not represent actual use. All user personal information involved in the embodiments of this application has been authorized (with the knowledge and consent) by the relevant parties or has been fully authorized by all parties, and the executing entity may obtain it through various legal and compliant means. The collection, storage, use, processing, transmission, provision, and disclosure of the information, data, and signals involved all comply with relevant laws and regulations and do not violate public order and good morals.
[0061] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A method for generating intelligent question answers based on a multimodal model, characterized in that, Includes the following steps: Collect patient image data, text data, and voice data; The image data, text data, and speech data are preprocessed respectively to generate visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned according to timestamps and patient identifiers and then fused into a multimodal data sample. The multimodal data samples are input into a large multimodal model for inference, and preliminary diagnostic results and preliminary treatment suggestions are output.
2. The intelligent question-answering generation method based on a multimodal model as described in claim 1, characterized in that, Collect image data, text data, and voice data, including: The image acquisition unit acquires images of the skin lesion area uploaded by the user; The text input unit obtains structured and unstructured text data entered by the user. The voice data uploaded by the user is acquired by the voice acquisition unit.
3. The intelligent question-answering generation method based on a multimodal model as described in claim 2, characterized in that, The image data, text data, and speech data are preprocessed respectively to generate visual feature vectors, semantic feature vectors, and speech intonation feature vectors, including: The image of the skin lesion area is subjected to noise reduction, color correction, and size compression. The processed lesion area image is encoded and converted to generate a visual feature vector; The structured text data is subjected to field validation and field completion processing; The speech data is analyzed and processed to extract speech intonation features, and then the speech data is converted into text. The unstructured text data and text text are processed by word segmentation, stop word removal, and part-of-speech tagging; The processed structured text data, unstructured text data, and text are converted into semantic feature vectors.
4. The intelligent question-answering generation method based on a multimodal model as described in claim 1, characterized in that, The visual feature vectors, semantic feature vectors, and speech intonation feature vectors are aligned according to timestamps and patient identifiers, and then fused into a multimodal data sample, including: Generate a globally unique patient identifier for the patient; The visual feature vector, semantic feature vector, and speech intonation feature vector are time-stamped according to the timestamp. Add modality type labels to visual feature vectors, semantic feature vectors, and speech intonation feature vectors; The visual feature vector, semantic feature vector, and speech intonation feature vector are aligned using globally unique patient identifiers, time stamps, and modality type labels to generate a multimodal feature vector correlation matrix. The correlation matrix of the multimodal feature vectors is fused to generate multimodal data samples.
5. The intelligent question-answering generation method based on a multimodal model as described in claim 1, characterized in that, Before inputting the multimodal data samples into a large multimodal model for inference and outputting preliminary diagnostic results and treatment recommendations, the following steps are included: Pre-build a large multimodal model based on the Transformer architecture; The visual feature vectors are encoded using a multimodal large-scale image encoder to generate visual vector codes; The semantic feature vector is encoded by a multimodal large-scale model text encoder to generate text vector encoding; The speech intonation feature vector is encoded by a multimodal large-scale speech encoder to generate speech vector encoding; A cross-attention mechanism is used to deeply fuse the visual vector encoding, text vector encoding, and speech vector encoding to generate multimodal vector encoding; The multimodal vector encoding is input into the decoder of the multimodal large model, and the model is trained in conjunction with the disease knowledge base until the diagnostic accuracy of the multimodal large model reaches a preset threshold.
6. The intelligent question-answering generation method based on a multimodal model as described in claim 1, characterized in that, The multimodal data samples are input into a large multimodal model for inference, and preliminary diagnostic results and preliminary treatment suggestions are output, including: The multimodal data samples are input into the diagnostic inference unit of the multimodal large model; The diagnostic reasoning unit integrates visual feature vectors and semantic feature vectors, and combines them with the disease knowledge base to perform reasoning to determine whether the user's skin lesion-related information meets the diagnostic requirements. If the skin lesion information meets the diagnostic requirements, a preliminary diagnosis and preliminary treatment recommendations will be output. If the skin lesion information does not meet the diagnostic requirements, supplementary consultation questions will be output.
7. The intelligent question-answering generation method based on a multimodal model as described in claim 1, characterized in that, After inputting the multimodal data samples into a large multimodal model for inference, and outputting preliminary diagnostic results and preliminary treatment suggestions, the following are included: Match the user with the appropriate attending physician based on their skin lesion information; Send image data, text data, voice data, preliminary diagnostic results, and preliminary treatment suggestions to the doctor's user terminal; The doctor obtains the doctor's modification suggestions on the preliminary diagnosis and preliminary treatment recommendations through the review and editing unit on the doctor's user end, and generates the final diagnosis and final treatment recommendations. The final diagnosis and treatment recommendations will be sent to the patient's client.
8. An intelligent question-answering generation device based on a multimodal model, characterized in that, The intelligent question-answering generation device based on a multimodal model includes: The data acquisition module is used to collect patients' image data, text data, and voice data; The data processing module is used to preprocess the image data, text data and voice data respectively to generate visual feature vectors, semantic feature vectors and voice intonation feature vectors; The data fusion module is used to align the visual feature vector, semantic feature vector, and speech tone feature vector according to the timestamp and patient identifier, and fuse them into a multimodal data sample. The data inference module is used to input the multimodal data samples into a large multimodal model for inference, and output preliminary diagnostic results and preliminary treatment suggestions.
9. A computer device, characterized in that, The computer device includes a memory, a processor, and a multimodal model-based intelligent question-answering generation program stored in the memory and executable on the processor. When executed by the processor, the multimodal model-based intelligent question-answering generation program implements the steps of the multimodal model-based intelligent question-answering generation method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The storage medium stores an intelligent question-answering generation program based on a multimodal model. When the intelligent question-answering generation program based on the multimodal model is executed by the processor, it implements the steps of the intelligent question-answering generation method based on a multimodal model as described in any one of claims 1-7.