Smart medical assistant
The intelligent medical assistant system addresses the lack of personalized recommendations in digital health assistants by integrating a client device with speech modules and a server with machine learning and LLM, providing interactive and accurate health information.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PUBLICHNOE AKTSIONERNOE OBSHCHESTVO SBERBANK ROSSII (PAO SBERBANK)
- Filing Date
- 2024-12-26
- Publication Date
- 2026-06-18
AI Technical Summary
Existing digital health assistants lack the capability to generate personalized health recommendations through direct dialogue with users due to the absence of large language model integration.
An intelligent medical assistant system that includes a client device with face detection, speech-to-text and text-to-speech modules, and a server with an ensemble of machine learning models and a large language model (LLM) to provide interactive health consultations, utilizing user data and EHR for enriched responses.
Enables interactive and accurate health information provision, including real-time recommendations based on user-specific data and historical health records, enhancing user interaction and response accuracy.
Smart Images

Figure RU2024000397_18062026_PF_FP_ABST
Abstract
Description
INTELLIGENT MEDICAL ASSISTANT AND METHOD OF PROVIDING THE USER WITH INFORMATION ABOUT THEIR HEALTH STATUS USING IT AREA OF TECHNOLOGY
[0001] The claimed solution relates to the field of information technology, in particular to intelligent assistants used to obtain information and advice regarding the user’s health. LEVEL OF TECHNOLOGY
[0002] Digital assistants are increasingly being used in various areas of human life, such as sports, health, education, and more. In addition to using artificial intelligence (AI)-based solutions to obtain specific information, the relevance and accuracy of the information generated by AI is critical, as is the ability to utilize up-to-date data on user status, taking into account the feedback from machine learning models.
[0003] In the context of AI applications in healthcare, such solutions are often used as a second opinion factor when analyzing medical images, selecting medications, forming second opinions, etc.
[0004] Intelligent digital assistants, such as https: / / www.binah.ai / , are known to collect data about a user's health by analyzing video images captured by a smartphone camera with a mobile app installed. This data can be used to determine a range of physiological parameters, including pulse, heart rate, blood pressure, stress level, blood oxygen saturation, body mass index, and more.
[0005] A disadvantage of the known solution is its insufficient functionality, due to the lack of generation of recommendations based on the use of large language models (LLM), which allows for direct dialogue with the user about their health status and the issuance of various types of recommendations. ESSENCE OF THE INVENTION
[0006] The claimed invention provides a solution to the technical problem of creating a more effective digital interactive medical assistant for providing user consultations.
[0007] The technical result is the expansion of functional capabilities in the formation of information about the user's health by providing an interactive dialogue with a large language model that enriches responses based on information about the user's condition received in real time.
[0008] The claimed technical result is achieved by an intelligent medical assistant comprising a client device including: at least one processor connected to at least one memory storing machine-executable instructions by the processor; information input / output devices representing at least a group of devices including: a display, speakers, a microphone and a video camera; a module for detecting faces based on data recorded by the video camera; a module for interacting with a user that converts speech into text and text into speech; a module for exchanging data with a server connected to the input / output devices, the face detection module and the module for interacting with a user, wherein the module for exchanging data with the server is configured to transmit user requests and collected data to the server;a server connected via a data transmission channel to a client device, wherein the server comprises: a user data processing module connected to an ensemble of machine learning models, wherein the ensemble comprises at least a set of machine learning models including: a model for determining gender, age, ethnicity and emotions; a model for determining body mass index (BMI); a photoplethysmography model that determines at least heart rate and pulse; a large language model (LLM) module connected to the user data processing module and providing for the generation of health data; the user by sending requests to the LLM, wherein the requests are enriched with data recorded by the user data processing module; the user interaction module, in turn, provides voice interaction with the user based on the conversion of the user's speech to text for transmission to the LLM, as well as the conversion of the text received from the server into speech played through the speakers of the client device; the information about the user's health generated by the LLM module is transmitted to the data exchange module with the server; and the user is provided with information about his health through the display and speakers of the client device.
[0009] In one particular implementation example, the client device is a computer, all-in-one PC, tablet, smartphone, or smart portable device.
[0010] In another particular example of implementation, the client device is configured to receive data from wearable user devices.
[0011] In another specific example of implementation, the server contains a module for connecting to the user's EHR.
[0012] In another specific implementation example, data from the user's EHR is transferred to LLM when generating recommendations.
[0013] In another particular implementation example, the photoplethysmography model additionally determines the intervals between successive heartbeats, the stress index, and the heart rate variability (PNN50).
[0014] The claimed technical result is also achieved through a method for obtaining information about the user's health, performed with the help of the above-mentioned assistant, wherein the method comprises the following stages: receiving a user's request for providing information about his health; transmitting the received request with information about the user's condition, received from the camera of the client device, to the server; processing the received request taking into account the data about the current condition of the user using LLM; transmit the response from LLM to the user. BRIEF DESCRIPTION OF DRAWINGS
[0015] Fig. 1 illustrates the general appearance of the claimed intelligent assistant.
[0016] Fig. 2 illustrates the general appearance of the computing device. IMPLEMENTATION OF THE INVENTION
[0017] As shown in Fig. 1, the claimed intelligent assistant is a distributed system consisting of a client part - a client device (200), connected via a data transmission channel to the server (300). The user (110) interacts with the client device (200) via input / output devices (201), which include at least the following devices: one or more microphones, one or more video cameras, speakers and a display (touch, resistive, LCD, TFT, etc.). The specified set of input / output devices (201) is basic and can be expanded depending on the type and form factor of the client device (201). Additionally, the devices (201) may include a keyboard, joystick, mouse-type manipulator, touchpad, trackball, touch panel, etc.
[0018] The client device (200) is controlled by one or more processors (201), which provide the main control function and logical processing of data and signals arising during the operation of the claimed solution. The processor (201) (or several processors, a multi-core processor) can be selected from a range of devices widely used today, for example, from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, etc. The processor should also include a graphics processor, for example, an NVIDIA or ATI GPU.
[0019] The device (200) also contains at least one memory (202) storing machine-readable instructions executed by the processor (201). The memory (202) may be random access memory (RAM), ROM in the form of a permanent data storage, for example, a hard disk drive (HDD), a solid-state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), or combinations of several types of storage devices.
[0020] The client device (200) also contains a face detection module (204) implemented as a software application, for example, a python application, using the Mediapipe library. Module (204) provides continuous detection of the user's face (110) on each processed frame captured by the video camera.
[0021] The user interaction module (205) performs speech-to-text and text-to-speech conversion, thereby enabling voice interaction with the server (300) and receiving feedback in the form of spoken responses. Module (205) can be implemented as a Python application using the Flask library. Module (205) can also support the operation and display of the user interface and visual information displayed on the device's display (200).
[0022] The server data exchange module (206) facilitates the exchange of messages and data between the client device (200) and the server (300). The module (206) can use the requests software library to send data to individual model modules and the user data processing module (301).
[0023] The client device (200) may be a device such as a smartphone, all-in-one PC, tablet, computer, laptop, television, etc., with the necessary software installed for interaction with the server (300). In one particular implementation example, the client device (200) may be implemented as a portable smart device, such as a wearable smartwatch, virtual reality glasses or headset, smart display (e.g., SberPortal), etc. Moreover, it should be obvious to a person skilled in the art that the client device (200) may be implemented as two or more of the above-mentioned devices, connected via standard protocols and communication principles.
[0024] The server (300) is a computing environment implemented as one or more individual computing devices (a server cluster), or as a cloud server. The server (300) contains the main components that process user (software) requests and data received from the client device (200).
[0025] The information recorded by the input / output devices (201) at the moment of interaction of the user (software) with the device (200) is transmitted via the module (206) for data exchange with the server to the user data processing module (301). The module (301) ensures the processing of incoming data by generating requests (prompts) to the LLM (304) via the LLM module (303), as well as the processing of incoming data recorded by the camera of the client device (200) using an ensemble of machine learning models (302), which determine the indicators and information about the health of the user (110) and transmit it for enriching the generated requests to the LLM (304).
[0026] The machine learning model ensemble (302) includes the following models.
[0027] User Identification Model. A face recognition model is used for user identification, authorization, and session management. The face image is resized to 224x224 pixels and loaded into the fast OFAMobileNetV3 model to extract face descriptors. The model is pre-trained on a portion of the VGGFace2 dataset. The training dataset contains feature vectors of known faces, obtained using the same procedure given their facial photographs. The extracted face embeddings are normalized and compared to the training dataset. A test dataset with 426 photographs of 104 people was used in the pilot study. The resulting recognition accuracy is 98.19%.
[0028] Socio-demographic model. The high-performance MobileNet-V1 model is used to predict age, gender, and ethnicity. The model was pre-trained to recognize faces from the VGGFace2 and Adience datasets. The ethnicity classifier was trained on a subset of the UTKFace dataset with different class weights to achieve better performance for imbalanced classes. This resulted in a gender classification accuracy of 93.79%. The mean absolute deviation (MAE) of age prediction is 5.74 years. The ethnicity prediction accuracy is 87.6%, which is at the state-of-the-art level. Facial expression recognition. The EfficientNet-BO model is used to predict user emotions (110). Specifically, pre-training was performed on the VGGFace2 dataset.Three classification heads were then added to predict eight baseline expressions (Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise), valence, and arousal using the training portion of the AffectNet dataset. Unlike typical face recognition training, heavily cropped faces were used in both training stages to remove hair, background, and other elements, allowing us to focus solely on facial expressions. As a result, using the validation portions of the AffectNet dataset as validation results, the accuracy for the eight emotion classes was 61.93%.
[0029] Body Mass Index Model. Body mass index (BMI) estimation is approached as an image regression problem. The dataset for this task includes samples (face-BMI values) from Reddit, a public social web platform, enriched with the FIW-BMI dataset. The detected faces are resized to a size of 256. x 256 pixels and fed into a ResNet34 model as a baseline with MSE (Mean Squared Error) applied as the metric and loss function.
[0030] Remote Photoplethysmography (rPPG) Model. The rPPG model extracts a photoplethysmography signal using video of the user's face. The model utilizes the Plane Orthogonal-to-Skin (POS) algorithm, which applies signal processing techniques to subtle changes in RGB values over time. The algorithm averaging pixel value changes across areas of interest on the face (cheeks, nose, and forehead), which are most valuable for pulse wave detection, offers high accuracy, reliability, and speed. rPPG calculation also yields several derived cardiovascular health parameters, such as heart rate, RR intervals (intervals between consecutive heartbeats), stress index, PNN50, and many others. All these health indicators enable a more accurate and comprehensive user experience (UX), improving user interaction.
[0031] By processing the user's health indicators (HU), the module (301) performs subsequent processing of the responses and, if necessary, generates additional questions for the user (SU). Depending on the context of the dialogue with the user (110), the user's medical biomarkers (SU), obtained using an ensemble of machine learning models (302), are included in the prompts for the LLM (304). Additionally, the server (300) may contain a module for connecting to the user's EHR (SU), providing the receipt of historical data on the user's health (e.g., test results, prescribed medications, doctor's recommendations, research results, etc.). This data may be additionally transmitted to the LLM (304) when requests are generated using the module (303) for a subsequent more accurate response to the request from the user (110).
[0032] The user data processing module (301) may contain a database for enriching prompts using RAG (Retrieval Augmented Generation) technology and utilize Chain-Of-Thoughts (COT) technology. This module (301) also processes images accompanying the dialogue to use the information extracted from them as a supplement to the text response. For example, the user may show a photograph of a skin condition that bothers them and specify the name of the pathology, so the assistant can display and / or reproduce via audio a sequence of possible actions depending on the lesion.
[0033] All biomarkers and data obtained from these two processing modes are captured by the assistant's context and are available to the LLM (304) for consideration in generating responses. At each frame, the assistant's logic module receives a set of measured parameters. This frequency of updating the user's state (101) allows the assistant to maintain context while interacting with the user and quickly respond to events when generating responses via the LLM (304).
[0034] Below is an example of a set of biomarkers collected by processing data from a client device's video camera (200). "session_id": "871", "user_id": "3769", "event_type": "status report", "data": { "face_present": true, "age": 35.996394872665405, "is_male": 1, "ethnicity": "White", "emotion": "Surprise", "bmi": 25.82625252859933, "diabetes" : 0.023319726411637424, "hr": 55.982491342644934, "stress": 66.84491978609626, "sdnn": 101.8650685695249, "pnn50": 0, "upper ap": 120.82511434450733, "lower ap": 71.71776270318506, "glycated_hemoglobin" : 5.566059725696541, "cholesterol": 4.102576252565574, "respiratory": 17.982170867779345, 'rigidity": 7.510560915144876, "nn_pulse": 77.19444063690175, "nn_age": 21.943865800617132, "nn gender" : 0.9979836797191464, "nn bmi": 23.7164672198402 "timestamp" : "2024- 10-25T 16:25: 15.543903"
[0035] In the example above, when forming a response to the question about the pulse rate (hr=55.98), the user's current emotional state (Surprise) will also be taken into account in order to include more neutral and reassuring expressions in the response along with factual information (a slight decrease from the norm of 60-90 beats).
[0036] Additionally, the claimed solution can determine glycated hemoglobin and cholesterol levels, which are biochemical parameters of human blood. Significant deviations in these values from the norm affect cardiac function and, consequently, pulse wave characteristics. The claimed assistant determines these values by approximating / restoring the values of these parameters based on pulse wave characteristics. This is accomplished using a trained neural network model, which can be part of an ensemble of models (302). This model is trained using a dataset consisting of video recordings of people's faces along with target parameter values.This model performs facial segmentation, identifies regions of interest / superpixels (chin, forehead, cheeks), “measures” color intensity (caused by blood flow to the skin), and encodes these values into the internal representation of the neural network to obtain reconstructed target values (glycated hemoglobin, cholesterol, etc.) in the final layer.
[0037] Let's consider the interaction between a user (software) and a medical assistant. The user (software) initiates the process using a client device (200). This can be accomplished either by interacting directly with the graphical user interface (GUI) displayed on the device's display or by using voice commands. When the dialogue is activated, the user's face is captured by the detection module (204) and subsequently recognized when fed to an ensemble of machine learning models (302).
[0038] During the dialogue with the assistant, the user (110) asks questions of interest to him about his health (or answers questions initiated by the assistant), and, if desired, accompanies the questions with additional materials (images, EHR data, etc.).
[0039] For example, a user asks, "Is my heart rate normal for exercise right now?" The assistant determines the user's (SW) current heart rate and pulse rate by processing the data captured by the video camera through an ensemble of models (302). The server (300) records the response from the ensemble of models (302) containing the user's (110) current parameters and transmits them along with the received request to the LLM (304) via module (303). The LLM (304) analyzes the received request, enriched with the user's (SW) health data, and generates a response, for example: "Currently, your heart rate is elevated and is 115 beats per minute, which is dangerous for your health during vigorous exercise. For a comfortable state during physical activity, it is recommended to perform them at a heart rate of 60-100 beats per minute."
[0040] The response is generated by the LLM (304) in the form of text information and transmitted to the module (301) for subsequent provision to the user (software) by generating speech audio data by converting the text of the LLM (304) response by the module (205) and reproducing it using speakers, as well as additionally displaying the user data (110) on the display of the client device (200).
[0041] A user's (software) request can be simultaneously translated into a request to an LLM (304) in multiple modalities (text, images) with support for image modality, as well as enrichment with medical parameters of the user's biomarkers (110). Various large language models can be used as LLM (304), such as GigaChat, LLaMA, and others.
[0042] After providing the user with an answer, the assistant can ask clarifying questions, or the user can continue the dialogue to obtain more detailed and accurate information regarding their health status.
[0043] The client device may additionally contain an interface for receiving data from wearable user devices, such as smart bracelets, smart watches, smart rings, heart rate monitors, Holters, etc. Information from these devices may be used to further assess the user's health status (110) when receiving responses from the LLM (304) by further enriching the prompts to it. This further improves the accuracy of the recommendations issued and other information regarding the user's health (110). io
[0044] Fig. 2 shows a general view of a computing device (400), with the help of which one of the client devices (200) and the server (300) can be fully or partially implemented. In the general case, the computing device (400) contains one or more processors (401), memory means such as RAM (402) and ROM (403), input / output interfaces (404), input / output devices (405), and a device for network interaction (406), united by a common information exchange bus.
[0045] The processor (401) (or multiple processors, a multi-core processor) can be selected from a range of devices widely used today, such as those from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and others. The processor also includes a graphics processor, such as an NVIDIA or ATI GPU. The memory capacity of the graphics card or graphics processor can also be used.
[0046] RAM (402) is random access memory (RAM) and is designed to store machine-readable instructions executed by the processor (401) to perform the necessary logical data processing operations. RAM (402) typically contains executable instructions from the operating system and corresponding software components (applications, software modules, etc.).
[0047] ROM (403) is one or more permanent data storage devices, such as a hard disk drive (HDD), a solid-state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.
[0048] To organize the operation of the device components (400) and to organize the operation of external connected devices, various types of I / O interfaces (404) are used. The choice of the appropriate interfaces depends on the specific design of the computing device, which may include, but are not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
[0049] To ensure user interaction with the computing device (400), various means (405) of I / O information are used, for example, a keyboard, a display (monitor), a touch display, a touchpad, a joystick, a mouse, a light pen, a stylus, a touch panel, a trackball, speakers, a microphone, augmented reality means, optical sensors, a tablet, light indicators, a projector, a camera, biometric identification means (a retinal scanner, a fingerprint scanner, a voice recognition module), etc. li
[0050] The network interaction means (406) ensures the transmission of data by the device (400) via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc. One or more means (406) may be, but are not limited to: an Ethernet card, a GSM modem, a GPRS modem, an LTE modem, a 5G modem, a satellite communication module, an NFC module, a Bluetooth and / or BLE module, a Wi-Fi module, etc.
[0051] Additionally, satellite navigation tools included in the device (400) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.
[0052] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.
Claims
FORMULA 1. An intelligent medical assistant comprising a client device that includes: at least one processor connected to at least one memory storing machine-executable instructions by the processor; information input / output devices that are at least a group of devices including: a display, speakers, a microphone, and a video camera; a face detection module based on data recorded by the video camera; a user interaction module that converts speech to text and text to speech; a data exchange module with a server connected to the input / output devices, the face detection module, and the user interaction module, wherein the data exchange module with the server is configured to transmit user requests and collected data to the server;a server connected via a data transmission channel to a client device, wherein the server comprises: a user data processing module connected to an ensemble of machine learning models, wherein the ensemble comprises at least a set of machine learning models including: a model for determining gender, age, ethnicity and emotions; a model for determining body mass index (BMI); a photoplethysmography model that determines at least heart rate and pulse; a user identification model; a large language model (LLM) module connected to the user data processing module and providing for the formation of data on the user's health by sending queries to the LLM, wherein the queries are enriched with data recorded by the user data processing module; wherein; The user interaction module provides voice interaction with the user by converting the user's speech into text for transmission to the LLM, as well as converting the text received from the server into speech played through the speakers of the client device; the user's health information generated by the LLM module is transmitted to the server data exchange module; and the user is provided with health information through the display and speakers of the client device.
2. The assistant according to paragraph 1, characterized in that the client device is a computer, all-in-one computer, tablet, smartphone, or intelligent portable device.
3. The assistant according to paragraph 2, characterized in that the client device is configured to receive data from wearable user devices.
4. The assistant according to paragraph 1, characterized in that the server contains a module for connecting to the user’s EHR.
5. The assistant according to paragraph 4, characterized in that the data from the user’s EHR is transferred to the LLM when generating recommendations.
6. The assistant according to claim 1, characterized in that the photoplethysmography model additionally determines intervals between successive heartbeats, a stress index, and heart rate variability (PNN50).
7. A method for obtaining information about a user's health, performed with the help of an assistant according to any of paragraphs 1-6, comprising the following steps: receiving a user's request to provide information about his health; transmitting the received request with information about the user's condition obtained from the camera of the client device to the server; processing the received request taking into account data about the current condition of the user using the LLM; transmitting a response from the LLM to the user.