Model training method, lip speech recognition method, device, electronic equipment and medium

By constructing a multimodal corpus and training a lip-reading recognition model using deep learning technology, the problem of communication difficulties caused by postoperative loss of voice in laryngeal cancer patients has been solved. This has achieved efficient and accurate lip-reading recognition, improving the convenience of postoperative life and mental health for patients.

CN119993156BActive Publication Date: 2026-06-30THE FIRST AFFILIATED HOSPITAL OF SUN YAT SEN UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
THE FIRST AFFILIATED HOSPITAL OF SUN YAT SEN UNIV
Filing Date
2025-01-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Postoperative loss of voice in patients with mid-to-late stage laryngeal cancer and hypopharyngeal cancer leads to communication difficulties, affecting the efficiency of communication between patients and their caregivers, medical staff, and family members, as well as their mental health. Existing lip-reading technology lacks adaptation to dialect environments and multimodal fusion, resulting in poor recognition performance.

Method used

A multimodal corpus was constructed, combining regional dialect characteristics with otolaryngology medical scenarios. Audio-visual data before and after surgery were collected, and text, emotion, and medical entity annotations were performed. A lip-reading recognition model was trained, and deep learning technology was used to improve recognition accuracy.

Benefits of technology

Improve the efficiency and accuracy of lip reading, solve the communication difficulties of patients in a voiceless state, and enhance postoperative life convenience and mental health.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119993156B_ABST
    Figure CN119993156B_ABST
Patent Text Reader

Abstract

This application discloses a model training method, a lip-reading recognition method, a device, an electronic device, and a medium. It acquires audio-visual data corresponding to a patient undergoing medical surgery, preprocesses the audio-visual data, and uses a trained lip-reading recognition model to perform lip-reading recognition on the preprocessed audio-visual data, outputting the lip-reading recognition result. This application can automatically recognize lip-reading in patient audio-visual data, improving the efficiency and accuracy of lip-reading recognition, solving the problem of communication difficulties for patients in a voiceless state, and improving the convenience of postoperative life for patients. It can be widely applied in the field of visual and speech recognition technology.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of visual speech recognition technology, and in particular to a model training method, a lip reading recognition method, a device, an electronic device, and a medium. Background Technology

[0002] Every year, a large number of patients with mid-to-late stage laryngeal cancer and hypopharyngeal cancer undergo partial laryngectomy, total laryngectomy, or tracheotomy. These patients are often able to speak and communicate normally before surgery, but may experience sudden loss of voice after surgery, leading to the following negative effects:

[0003] 1) Inability to communicate normally with caregivers: This prevents patients from expressing their wishes in a timely manner. Whether it is physiological needs or physical discomfort, failure to address these issues promptly may affect the patient's postoperative recovery.

[0004] 2) Inability to communicate normally with medical staff: The decline in the efficiency of medical staff is one problem; in addition, if the patient's wishes are not conveyed in a timely and accurate manner, it may cause misunderstandings, affect the doctor-patient relationship and even medical decisions.

[0005] 3) Inability to communicate normally with family members: Patients will feel the inconvenience of postoperative life to the greatest extent, which is difficult to accept psychologically and is not conducive to mental health and postoperative recovery.

[0006] Therefore, the above problems urgently need to be solved. Summary of the Invention

[0007] The main objective of this application is to propose a model training method, lip reading recognition method, device, electronic device, and medium that can automatically recognize the lip reading of a patient's audio-visual recordings, improve the efficiency and accuracy of lip reading recognition, and solve the problem of communication difficulties for patients in a voiceless state.

[0008] On the one hand, embodiments of this application propose a model training method, which includes the following steps:

[0009] A sample dataset was obtained from a multimodal corpus; the sample dataset includes multiple patient audio-visual data with text labels;

[0010] The sample dataset is divided into a training set and a test set;

[0011] Construct a lip-reading recognition model and train the lip-reading recognition model using the training set;

[0012] The trained lip-reading model is evaluated using the test set to determine whether to continue training the lip-reading model.

[0013] In some embodiments, obtaining the sample dataset from the multimodal corpus specifically includes:

[0014] Construct the multimodal corpus;

[0015] Collect pre- and post-operative audio-visual data for multiple medical surgical patients, and perform data preprocessing on the pre- and post-operative audio-visual data; the pre- and post-operative audio-visual data includes pre-operative audio-visual data and post-operative audio-visual data; the medical surgical patients include patients who have undergone partial laryngectomy, total laryngectomy, or tracheotomy.

[0016] Text recognition and annotation are performed on the preprocessed pre- and post-operative audio-visual data to obtain text tags corresponding to each pre- and post-operative audio-visual data, and the pre- and post-operative audio-visual data containing the text tags are stored in the multimodal corpus.

[0017] In some embodiments, the step of performing text recognition and annotation on the preprocessed pre- and post-operative audio-visual data to obtain text tags corresponding to each of the pre- and post-operative audio-visual data specifically includes:

[0018] The pre- and post-operative audio-visual data were analyzed using a speech recognition model to identify multiple text annotations and the corresponding video timestamps and dialect types for each text annotation.

[0019] The sentiment recognition of each text annotation is performed using an sentiment computing model to determine the sentiment feature label corresponding to each text annotation.

[0020] The medical entity annotation model is used to perform medical entity annotation on each of the text annotation contents to determine the medical entity annotation corresponding to each of the text annotation contents.

[0021] The pre- and post-operative audio-visual data are divided into video frames to obtain multiple video frames and timestamps corresponding to each video frame.

[0022] Based on the dialect type, the sentiment feature label, and the medical entity label corresponding to each of the text annotation contents, determine the text description label corresponding to each of the text annotation contents;

[0023] Based on the video timestamp corresponding to each of the text annotation contents and the timestamp corresponding to each of the video frames, the text annotation contents and each video frame are paired to generate video frame text pairing tags corresponding to each of the text annotation contents;

[0024] The text tags are determined based on the video frame text pairing tags corresponding to each of the text annotation contents.

[0025] In some embodiments, the step of pairing text annotation content with each video frame based on the video timestamp corresponding to each text annotation content and the timestamp corresponding to each video frame, and generating video frame text pairing tags corresponding to each text annotation content, specifically includes:

[0026] Based on the video timestamp corresponding to each of the text annotation contents and the timestamp corresponding to each of the video frames, determine a number of target video frames corresponding to each of the text annotation contents from a plurality of video frames;

[0027] The text description tags corresponding to each of the text annotation contents are associated and bound with each of the target video frames to generate text pairing tags for each of the text annotation contents.

[0028] In some embodiments, the step of evaluating the trained lip-reading model using the test set to determine whether to continue training the lip-reading model specifically includes:

[0029] Obtain multiple model evaluation metrics and the corresponding model evaluation metric thresholds for each of the aforementioned model evaluation metrics;

[0030] The test set is input into the lip reading recognition model, and the lip reading recognition model is used to output the corresponding lip reading recognition result.

[0031] Based on the model evaluation indicators and the lip reading results, the lip reading model is evaluated to determine the current evaluation value of the model corresponding to each model evaluation indicator.

[0032] Based on the model evaluation index threshold corresponding to each of the model evaluation indices and the current evaluation value of the model, the model evaluation result is determined, and based on the model evaluation result, it is determined whether to continue training the lip reading recognition model;

[0033] When the model evaluation result is that the current evaluation value of the model corresponding to each of the model evaluation indicators exceeds the threshold of the model evaluation indicator, the training of the lip reading recognition model is stopped; otherwise, the training of the lip reading recognition model continues.

[0034] On the other hand, embodiments of this application propose a lip-reading recognition method, which includes the following steps:

[0035] Obtain the audio-visual data to be identified corresponding to the medical surgery patient, and perform data preprocessing on the audio-visual data to be identified;

[0036] The lip-reading recognition model is used to perform lip-reading recognition on the preprocessed audio-visual data to be recognized, and the lip-reading recognition result is output; the lip-reading recognition model is trained by the model training method described above.

[0037] In some embodiments, the step of using a lip-reading recognition model to perform lip-reading recognition on the preprocessed audio-visual data to be recognized and outputting the lip-reading recognition result specifically includes:

[0038] The lip-reading recognition model is used to perform video frame segmentation and lip-reading recognition on the preprocessed audio-visual data to be identified, and outputs multiple text recognition information corresponding to the video data to be identified and multiple video frames to be paired corresponding to each text recognition information.

[0039] The lip-reading recognition model is used to pair each text recognition information with the corresponding multiple video frames to be paired, and the text matching result of each text recognition information is determined. Based on the text matching result of each video frame, each text recognition information is inserted into the audio-visual data to be recognized, and the corresponding lip-reading recognition audio-visual data is output.

[0040] On the other hand, embodiments of this application propose a lip-reading recognition device, the device comprising:

[0041] The first module is used to acquire the audio-visual data to be identified corresponding to the medical surgery patient and to perform data preprocessing on the audio-visual data to be identified.

[0042] The second module is used to perform lip reading on the preprocessed audio-visual data to be identified using a lip reading recognition model and output the lip reading recognition result; the lip reading recognition model is trained using the model training method described above.

[0043] On the other hand, embodiments of this application propose an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the model training method or the lip reading recognition method described above.

[0044] On the other hand, embodiments of this application propose a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned model training method or the aforementioned lip reading method.

[0045] The embodiments of this application include at least the following beneficial effects: This application provides a model training method, lip-reading recognition method, device, electronic device, and medium. It acquires audio-visual data corresponding to a surgical patient, preprocesses the audio-visual data, uses a trained lip-reading recognition model to perform lip-reading recognition on the preprocessed audio-visual data, and outputs the lip-reading recognition result. This application can automatically recognize lip-reading in patient audio-visual data, improving the efficiency and accuracy of lip-reading recognition, solving the problem of communication difficulties for patients in a voiceless state, and improving the convenience of postoperative life for patients. Attached Figure Description

[0046] Figure 1 This is a flowchart of a model training method provided in an embodiment of this application;

[0047] Figure 2 This is a flowchart of a lip-reading recognition model provided in an embodiment of this application;

[0048] Figure 3 This is a schematic diagram of the structure of a lip-reading recognition device provided in an embodiment of this application;

[0049] Figure 4 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.

[0051] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”

[0052] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.

[0053] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0054] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirection to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments acquired.

[0055] Currently, a large number of patients with mid-to-late stage laryngeal cancer and hypopharyngeal cancer undergo partial laryngectomy, total laryngectomy, or tracheotomy every year. These patients are often able to speak and communicate normally before surgery, but may experience sudden loss of voice after surgery, leading to the following negative effects:

[0056] 1) Inability to communicate normally with caregivers: This prevents patients from expressing their wishes in a timely manner. Whether it is physiological needs or physical discomfort, failure to address these issues promptly may affect the patient's postoperative recovery.

[0057] 2) Inability to communicate normally with medical staff: The decline in the efficiency of medical staff is one problem; in addition, if the patient's wishes are not conveyed in a timely and accurate manner, it may cause misunderstandings, affect the doctor-patient relationship and even medical decisions.

[0058] 3) Inability to communicate normally with family members: Patients will feel the inconvenience of postoperative life to the greatest extent, which is difficult to accept psychologically and is not conducive to mental health and postoperative recovery.

[0059] Lip reading, also known as "lip reading" (LR), falls under the category of Visual Speech Recognition (VSR). It's a technology that uses visual information to infer and understand a speaker's language or pronunciation. As an emerging topic at the intersection of computer vision and natural language processing, lip reading has shown broad prospects in speech recognition, human-computer interaction, and public safety in recent years, thanks to the development of deep learning. In the healthcare field, lip reading systems are already being used for assistive communication for the hearing impaired.

[0060] Currently, the application of lip-reading technology in medical settings still faces many challenges, specifically in the following aspects:

[0061] 1) Lack of dialect-specific corpora. Most existing lip-reading technologies rely on standardized open-source corpora, which are mostly concentrated in standard Mandarin or English, resulting in a lack of adaptability to dialect features and poor recognition performance.

[0062] 2) Lack of corpora designed for specific scenarios. For example, hospital scenarios often involve a large number of medical terms, and existing models lack such personalized training data, resulting in limited recognition performance;

[0063] 3) Multimodal fusion problem. Lip reading recognition not only relies on visual information (lip shape, facial expressions, etc.), but also requires the support of audio information such as speech and intonation. Most existing lip reading models rely only on a single modality (such as video or speech), failing to effectively combine the multimodal features of video and audio data, which affects the accuracy and robustness of the model.

[0064] The above problems severely restrict the application of lip-reading technology in medical settings.

[0065] Based on this, this application proposes a model training method, lip reading recognition method, device, electronic device, and medium. Combining the dialect characteristics of different regions (such as Cantonese, Hakka, and Minnan) with the otolaryngology medical scenario, a multimodal corpus is constructed. The aim is to provide high-quality and diversified data support for lip reading recognition technology applied to postoperative aphonia patients. The corpus is used to train the lip reading recognition model to complete lip reading tasks, improve the efficiency and accuracy of lip reading recognition, and solve the problem of communication difficulties for patients in aphonia.

[0066] Reference Figure 1 , Figure 1 This is an optional flowchart of a model training method provided in an embodiment of this application. The method may include, but is not limited to, steps S101 to S104:

[0067] Step S101: Obtain a sample dataset from a multimodal corpus; the sample dataset includes patient audio-visual data with multiple text tags;

[0068] Step S102: Divide the sample dataset to obtain the training set and the test set;

[0069] Step S103: Construct a lip-reading recognition model and train the lip-reading recognition model using the training set;

[0070] Step S104: Use the test set to evaluate the trained lip-reading model and determine whether to continue training the lip-reading model.

[0071] In some embodiments, step S101 may include, but is not limited to, steps S201 to S203:

[0072] Step S201: Construct a multimodal corpus;

[0073] Step S202: Collect pre- and post-operative audio-visual data for multiple medical surgery patients and perform data preprocessing on the pre- and post-operative audio-visual data; the pre- and post-operative audio-visual data includes pre-operative audio-visual data and post-operative audio-visual data; medical surgery patients include patients undergoing partial laryngectomy, total laryngectomy, or tracheotomy.

[0074] Step S203: Perform text recognition and annotation on the preprocessed pre- and post-operative audio-visual data to obtain the text tags corresponding to each pre- and post-operative audio-visual data, and store the pre- and post-operative audio-visual data containing the text tags into a multimodal corpus.

[0075] In some embodiments, the aforementioned pre- and post-operative audio-visual data includes audio-visual data recorded by the patient before surgery (such as the aforementioned pre-operative audio-visual data), as well as audio-visual data or video (lip shape) data at each recovery stage after surgery (such as the aforementioned post-operative audio-visual data). For example, by using pre-operative audio-visual data to perform personalized small-sample training on the lip-reading recognition model, the lip-reading recognition model can be quickly adjusted in subsequent data training on input post-operative audio-visual data, thereby greatly improving the accuracy and robustness of lip-reading recognition.

[0076] In some embodiments, a multimodal corpus is constructed, which combines the dialect characteristics of different regions (such as Cantonese, Hakka, and Minnan) with the otolaryngology medical scenario. After collecting pre- and post-operative audio-visual data, it is also stored in the multimodal corpus. Then, the pre- and post-operative audio-visual data that has not been text-annotated is obtained from the multimodal corpus and text-annotated.

[0077] Specifically, pre- and post-operative audio-visual data may include, but are not limited to:

[0078] 1) Preoperative and postoperative audio-visual data: During the patient's hospitalization, audio and video data will be collected before and after surgery, including conversations between the patient and medical staff, family members, and caregivers. Recording requires prior consent from both parties, and the recording process must comply with ethical approval requirements. Note that audio and video data with unclear lip movements or pronunciation cannot be excluded, as postoperative aphonia patients may experience some deformation and weakening of facial movements due to changes in laryngeal anatomy and wound swelling. This data also needs to be collected for training purposes to help learn and adapt to lip reading recognition in such situations.

[0079] 2) Postoperative outpatient follow-up data: After patients are discharged from the hospital, outpatient follow-up data can be collected to obtain feedback on the long-term recovery of patients after surgery;

[0080] 3) Online health consultation data: For internet-based healthcare, patients can record and upload videos;

[0081] 4) Postoperative simulation dialogue data: Design common problems that postoperative patients may encounter, and organize medical experts, doctors and volunteers to simulate different postoperative dialogue scenarios, such as postoperative consultation, health guidance, and discussion of subsequent treatment plans.

[0082] Meanwhile, the multimodal corpus also stores public resources and open datasets, including some public datasets related to medical care and Cantonese, which can be used as a supplement to the preliminary corpus to guide subsequent speech recognition models in recognizing dialect types in pre- and post-operative audio-visual data.

[0083] In some embodiments, optionally, since the ability of surgical patients to speak varies from person to person after a period of time, and is related to the recovery status of the surgical patients, if the tracheostomy tube is successfully blocked, the surgical patients can begin to attempt to speak. Based on this, the collected postoperative audio-visual data includes:

[0084] 1) Video data before tube closure after surgery: Patients cannot speak before tube closure after surgery, therefore, video data of patients during this recovery period was collected;

[0085] 2) Audio-visual data of successful tube closure after surgery: After successful tube closure, patients can try to speak. Therefore, audio-visual data of patients during this recovery period are collected, including video data and audio data.

[0086] The lip reading recognition model described above is adept at handling the problem of "modal missing" data in the dataset. Therefore, the lack of audio data in the video data before the postoperative tube closure will not interfere with the construction of the multimodal corpus or the training of the model.

[0087] In some embodiments, step S202 may include, but is not limited to, steps S301 to S307:

[0088] Step S301: Use a speech recognition model to perform text recognition on the audio and video data before and after surgery, and determine multiple text annotation contents and the video timestamp and dialect type corresponding to each text annotation content;

[0089] Step S302: Use the sentiment computing model to perform sentiment recognition on each text annotation content and determine the sentiment feature label corresponding to each text annotation content;

[0090] Step S303: Use the medical entity annotation model to perform medical entity annotation on each text annotation content, and determine the medical entity annotation corresponding to each text annotation content;

[0091] Step S304: Divide the pre- and post-operative audio-visual data into video frames to obtain multiple video frames and the timestamps corresponding to each video frame;

[0092] Step S305: Determine the text description label corresponding to each text annotation content based on the dialect type, sentiment feature label, and medical entity annotation corresponding to each text annotation content;

[0093] Step S306: Based on the video timestamp corresponding to each text annotation content and the timestamp corresponding to each video frame, pair the text annotation content with each video frame to generate video frame text pairing tags corresponding to each text annotation content.

[0094] Step S307: Determine the text labels based on the video frame text matching labels corresponding to each text annotation content.

[0095] In some embodiments, data preprocessing of pre- and post-operative audio-visual data may include, but is not limited to, preliminary preprocessing and data augmentation. Preliminary preprocessing may include, but is not limited to, video editing, audio noise reduction, video frame extraction, and resampling to improve data quality and consistency. A deep learning-based pre-trained model is introduced to enhance and repair the image quality of pre- and post-operative audio-visual data, and to optimize segments with blurry image quality or insufficient lighting.

[0096] Data augmentation was performed on the pre- and post-operative audio-visual data after initial preprocessing, including operations such as rotation, magnification, and reduction, to improve the comprehensiveness of the data and enhance the robustness of subsequent models.

[0097] In some embodiments, a speech recognition model is used to transcribe audio data from the augmented pre- and post-operative audio-visual data (as described above). Specifically, the speech recognition model (such as Whisper or Wav2Vec2.0) is pre-trained using a multimodal corpus. The pre-trained speech recognition model transcribes the audio signal into text data, determines multiple text annotations and their corresponding video timestamps, and annotates each text annotation with dialect type and accent features to determine the dialect type corresponding to each annotation. Manual proofreading is then performed to improve accuracy. For any unclear speech that may exist in the post-operative patient audio, a specific audio enhancement system, such as a speech enhancement system based on generative adversarial networks, is used for processing.

[0098] Based on multiple preset dialect types (such as Cantonese, Hakka, and Minnan), dialect recognition is performed on the audio-visual data before and after surgery. The dialect types corresponding to the audio-visual data before and after surgery are determined and labeled, which can provide a new language environment data for lip reading technology and solve the problem of insufficient dialect adaptability of traditional lip reading technology.

[0099] In some embodiments, medical entity annotation models (such as MedGPT or BioBERT) are used to annotate the text content obtained after transcribing audio data with medical entities. This includes information such as disease names (e.g., laryngeal cancer, hypopharyngeal cancer), drug names (e.g., anti-inflammatory drugs, analgesics, chemotherapy drugs, targeted drugs), medical procedures (e.g., surgery, laryngoscopy, pathology, follow-up visits), and symptom descriptions (e.g., pain, bleeding, nausea). The medical entity information contained in each text annotation content is determined and annotated to generate corresponding medical entity annotations. At the same time, manual proofreading is combined to improve the effectiveness of medical entity annotations, so that the lip reading recognition model can accurately capture relevant medical terms during subsequent training.

[0100] In some embodiments, an affective computing model adapted to medical scenarios (such as a Transformer-based affective analysis framework) is specifically trained and fine-tuned. Based on the affective computing model, affective recognition is performed on each text annotation, including affective (anxiety, concern, etc.) and tone (affirmation, question, request, etc.) annotation, in order to capture the patient's affective features in different contexts and generate affective feature labels corresponding to each text annotation.

[0101] In some embodiments, step S306 may include, but is not limited to, steps S401 to S402:

[0102] Step S401: Based on the video timestamps corresponding to each text annotation and the timestamps corresponding to each video frame, determine several target video frames corresponding to each text annotation from multiple video frames.

[0103] Step S402: Associate and bind the text description tags corresponding to each text annotation content with each target video frame to generate video frame text pairing tags corresponding to each text annotation content.

[0104] In some embodiments, video frames are aligned with their timestamps and corresponding text annotations to obtain corresponding video-audio-text tag pairs, namely the video frame text pairing tags mentioned above.

[0105] For example, assuming the video timestamps of the text annotation content are T1 to T2, based on the timestamps of each video frame, video frames X to Y are determined to be multiple target video frames corresponding to the text annotation content. The text description tags of the text annotation content are associated and bound with video frames X to Y to generate video frame text pairing tags. Based on the video frame text pairing tags, the text annotation content containing the text description tags is inserted into the video segments corresponding to video frames X to Y. Here, the timestamp of video frame X is T1, the timestamp of video frame Y is T2, and there are multiple video frames between video frames X and video frames Y.

[0106] In some embodiments, the text tags of the video frames corresponding to each text annotation content constitute the text tags of the pre- and post-operative audio-visual data.

[0107] In some embodiments, the sample dataset is divided into a training set, a validation set, and a test set. The lip-reading recognition model is trained using the training set, validated using the validation set, and evaluated using the test set. The lip-reading recognition model that passes the evaluation is then used for subsequent lip-reading recognition of the audio-visual data to be recognized.

[0108] Lip reading recognition models may include, but are not limited to, deep neural network (DNN) models, end-to-end models, encoder-decoder models, and 3DCNN audiovisual synchronization auxiliary models.

[0109] In some embodiments, the pre- and post-operative audio-visual data containing text tags can be converted into a corresponding preset format and stored in the multimodal corpus. A sample dataset is obtained from the multimodal corpus, and the sample dataset includes patient audio-visual data with multiple text tags.

[0110] In some embodiments, step S104 may include, but is not limited to, steps S501 to S505:

[0111] Step S501: Obtain multiple model evaluation metrics and the corresponding model evaluation metric thresholds for each model evaluation metric;

[0112] Step S502: Input the test set into the lip reading recognition model and use the lip reading recognition model to output the corresponding lip reading recognition results;

[0113] Step S503: Based on the evaluation indicators of each model and the lip reading recognition results, perform model evaluation on the lip reading recognition model and determine the current evaluation value of the model corresponding to each evaluation indicator.

[0114] Step S504: Determine the model evaluation result based on the model evaluation indicator threshold and the current model evaluation value corresponding to each model evaluation indicator, and determine whether to continue training the lip reading recognition model based on the model evaluation result.

[0115] Step S505: When the model evaluation result shows that the current evaluation value of the model corresponding to each model evaluation index exceeds the threshold of the model evaluation index, stop training the lip reading recognition model; otherwise, continue training the lip reading recognition model.

[0116] In some embodiments, the model evaluation metric can be precision, recall, F1 score or AUC score, etc., and each model evaluation metric has a corresponding model evaluation metric threshold.

[0117] Reference Figure 2 , Figure 2 This is an optional flowchart of a lip-reading recognition method provided in an embodiment of this application. The method may include, but is not limited to, steps S1 to S2:

[0118] Step S1: Obtain the audio-visual data to be identified for the medical surgery patient and perform data preprocessing on the audio-visual data to be identified;

[0119] Step S2: Use the lip reading recognition model to perform lip reading recognition on the preprocessed audio-visual data to be recognized, and output the lip reading recognition result; the lip reading recognition model is trained using the model training method described above.

[0120] In some embodiments, data preprocessing is performed on the audio-visual data to be identified, including video editing, audio noise reduction, video frame extraction, resampling, and data augmentation, to improve the quality of the audio-visual data to be identified and to improve the accuracy and efficiency of subsequent lip reading recognition of the audio-visual data to be identified.

[0121] In some embodiments, step S2 may include, but is not limited to, steps S21 to S22:

[0122] Step S21: Use the lip reading recognition model to divide the preprocessed audio-visual data to be recognized into video frames and perform lip reading recognition, and output multiple text recognition information corresponding to the video data to be recognized and multiple video frames to be paired corresponding to each text recognition information.

[0123] Step S22: Use the lip reading recognition model to pair each text recognition information with multiple corresponding video frames to be paired, determine the text pairing result of each text recognition information, insert each text recognition information into the audio-visual data to be recognized according to the text pairing result of each video frame, and output the corresponding lip reading recognition audio-visual data.

[0124] In some embodiments, a lip-reading model is used to pair each text recognition information with multiple corresponding video frames to be paired, multiple target paired video frames are determined, the text recognition information is inserted into the video segments corresponding to the multiple target paired video frames, and the corresponding lip-reading audio-visual data is output to realize lip-reading recognition.

[0125] Reference Figure 3 , Figure 3 This is an optional structural diagram of a lip-reading recognition device provided in an embodiment of this application. The device is used to implement the aforementioned lip-reading recognition method and may include:

[0126] The first module is used to acquire the audio-visual data to be identified for medical surgical patients and to perform data preprocessing on the audio-visual data to be identified.

[0127] The second module is used to perform lip reading on the preprocessed audio-visual data to be identified using a lip reading recognition model and output the lip reading recognition results; the lip reading recognition model is trained using the model training method described above.

[0128] It is understood that the content of the above method embodiments is applicable to the present device embodiments. The specific functions implemented by the present device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0129] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described model training method or lip-reading recognition method. This electronic device can be any smart terminal, including tablet computers.

[0130] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0131] Please see Figure 4 , Figure 4 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:

[0132] The processor 901 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.

[0133] The memory 902 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 902 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 902 and is called and executed by the processor 901 to execute the model training method or lip reading recognition method of the embodiments of this application.

[0134] The input / output interface 903 is used to implement information input and output;

[0135] The communication interface 904 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0136] Bus 905 transmits information between various components of the device (e.g., processor 901, memory 902, input / output interface 903, and communication interface 904);

[0137] The processor 901, memory 902, input / output interface 903, and communication interface 904 are connected to each other within the device via bus 905.

[0138] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described model training method or lip reading method.

[0139] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.

[0140] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0141] The embodiments of this application provide a model training method, lip reading recognition method, device, electronic device, and medium that can automatically recognize the lip reading of a patient's audio-visual recordings, improve the efficiency and accuracy of lip reading recognition, solve the problem of communication difficulties for patients in a voiceless state, and improve the convenience of patients' lives after surgery.

[0142] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0143] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.

[0144] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0145] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.

[0146] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0147] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0148] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0149] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0150] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0151] It should be recognized that embodiments of the present invention can be implemented or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer-readable storage medium. The method can be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with a computer program, wherein the storage medium is configured such that the computer operates in a specific and predefined manner—according to the methods and drawings described in the specific embodiments. Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with the computer system. However, if desired, the program can be implemented in assembly or machine language. In any case, the language can be a compiled or interpreted language. Furthermore, for this purpose, the program can run on a programmed application-specific integrated circuit (ASIC).

[0152] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0153] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.

Claims

1. A model training method, characterized in that, The method includes the following steps: A sample dataset was obtained from a multimodal corpus; the sample dataset includes multiple patient audio-visual data with text labels; The sample dataset is divided into a training set and a test set; A lip-reading recognition model was constructed. After training the lip-reading recognition model with preoperative audio-visual data, the lip-reading recognition model was trained with postoperative audio-visual data. The trained lip-reading model is evaluated using the test set to determine whether to continue training the lip-reading model. The process of obtaining the sample dataset from the multimodal corpus specifically includes: Construct the multimodal corpus; The system collects pre- and post-operative audio-visual data from multiple medical surgical patients and performs data preprocessing on the data. The pre- and post-operative audio-visual data includes pre-operative audio-visual data, post-operative audio-visual data, post-operative outpatient follow-up data, online health consultation data, and post-operative simulated dialogue data. The post-operative audio-visual data includes video data before tube closure and audio-visual data showing successful tube closure. The medical surgical patients include those who have undergone partial laryngectomy, total laryngectomy, or tracheotomy. Text recognition and annotation are performed on the pre-processed pre- and post-operative audio-visual data to obtain text tags corresponding to each pre- and post-operative audio-visual data, and the pre- and post-operative audio-visual data containing the text tags are stored in the multimodal corpus.

2. The model training method according to claim 1, characterized in that, The step of performing text recognition and annotation on the preprocessed pre- and post-operative audio-visual data to obtain text tags corresponding to each piece of pre- and post-operative audio-visual data specifically includes: The pre- and post-operative audio-visual data were analyzed using a speech recognition model to identify multiple text annotations and the corresponding video timestamps and dialect types for each text annotation. The sentiment recognition of each text annotation is performed using an sentiment computing model to determine the sentiment feature label corresponding to each text annotation. The medical entity annotation model is used to perform medical entity annotation on each of the text annotation contents to determine the medical entity annotation corresponding to each of the text annotation contents. The pre- and post-operative audio-visual data are divided into video frames to obtain multiple video frames and timestamps corresponding to each video frame. Based on the dialect type, the sentiment feature label, and the medical entity label corresponding to each of the text annotation contents, determine the text description label corresponding to each of the text annotation contents; Based on the video timestamp corresponding to each of the text annotation contents and the timestamp corresponding to each of the video frames, the text annotation contents and each video frame are paired to generate video frame text pairing tags corresponding to each of the text annotation contents; The text tags are determined based on the video frame text pairing tags corresponding to each of the text annotation contents.

3. The model training method according to claim 2, characterized in that, The step of pairing text annotations with video frames based on the video timestamps corresponding to each text annotation and the timestamps corresponding to each video frame, and generating video frame text pairing tags corresponding to each text annotation, specifically includes: Based on the video timestamp corresponding to each of the text annotation contents and the timestamp corresponding to each of the video frames, determine a number of target video frames corresponding to each of the text annotation contents from a plurality of video frames; The text description tags corresponding to each of the text annotation contents are associated and bound with each of the target video frames to generate text pairing tags for each of the text annotation contents.

4. The model training method according to claim 1, characterized in that, The step of evaluating the trained lip-reading model using the test set to determine whether to continue training the lip-reading model specifically includes: Obtain multiple model evaluation metrics and the corresponding model evaluation metric thresholds for each of the aforementioned model evaluation metrics; The test set is input into the lip reading recognition model, and the lip reading recognition model is used to output the corresponding lip reading recognition result. Based on the model evaluation indicators and the lip reading results, the lip reading model is evaluated to determine the current evaluation value of the model corresponding to each model evaluation indicator. Based on the model evaluation index threshold corresponding to each of the model evaluation indices and the current evaluation value of the model, the model evaluation result is determined, and based on the model evaluation result, it is determined whether to continue training the lip reading recognition model; When the model evaluation result is that the current evaluation value of the model corresponding to each of the model evaluation indicators exceeds the threshold of the model evaluation indicator, the training of the lip reading recognition model is stopped; otherwise, the training of the lip reading recognition model continues.

5. A lip-reading recognition method, characterized in that, The method includes the following steps: Obtain the audio-visual data to be identified corresponding to the medical surgery patient, and perform data preprocessing on the audio-visual data to be identified; The lip-reading recognition model is used to perform lip-reading recognition on the preprocessed audio-visual data to be recognized, and the lip-reading recognition result is output; the lip-reading recognition model is trained by the model training method described in any one of claims 1 to 4.

6. The lip-reading recognition method according to claim 5, characterized in that, The process of using a lip-reading model to perform lip-reading on preprocessed audio-visual data and outputting lip-reading results specifically includes: The lip-reading recognition model is used to perform video frame segmentation and lip-reading recognition on the preprocessed audio-visual data to be identified, and outputs multiple text recognition information corresponding to the audio-visual data to be identified and multiple video frames to be paired corresponding to each text recognition information. The lip-reading recognition model is used to pair each text recognition information with the corresponding multiple video frames to be paired, and the text matching result of each text recognition information is determined. Based on the text matching result of each video frame, each text recognition information is inserted into the audio-visual data to be recognized, and the corresponding lip-reading recognition audio-visual data is output.

7. A lip-reading recognition device, characterized in that, The device includes: The first module is used to acquire the audio-visual data to be identified corresponding to the medical surgery patient and to perform data preprocessing on the audio-visual data to be identified. The second module is used to perform lip reading on the preprocessed audio-visual data to be identified using a lip reading recognition model and output the lip reading recognition result; the lip reading recognition model is trained by the model training method described in any one of claims 1 to 4.

8. An electronic device, characterized in that, The electronic device includes a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the model training method according to any one of claims 1 to 4 or the lip reading recognition method according to any one of claims 5 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the model training method according to any one of claims 1 to 4 or the lip reading recognition method according to any one of claims 6 to 7.