A medical speech recognition-based gastroscopy report generation system and method
By employing adaptive noise reduction and mask compensation technologies, combined with the status of the endoscopic examination process, and dynamically adjusting noise processing and voice activity detection, the problems of noise interference and process mismatch in the gastrointestinal endoscopy report generation system have been solved, achieving highly accurate and structured report generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI HAOKANGYUN MEDICAL TECHNOLOGY DEVELOPMENT CO LTD
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245588A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech recognition technology, and more specifically, to a system and method for generating gastrointestinal endoscopy reports based on medical speech recognition. Background Technology
[0002] Existing systems and methods for generating gastrointestinal endoscopy reports have the following main problems: Gastrointestinal endoscopy is an important diagnostic tool for digestive system diseases. During the examination, doctors typically record the examined sites, lesion morphology, and key procedures by verbally describing them in real time. After the examination, the endoscopic images are combined to generate a gastrointestinal endoscopy report. With the increase in the volume of examinations and the growing demand for information technology, utilizing voice recognition technology to assist in generating gastrointestinal endoscopy reports has become one of the current technological development directions.
[0003] However, existing systems and methods for recording and generating voice reports during gastrointestinal and colonoscopy examinations still have many shortcomings in practical applications. Firstly, gastrointestinal and colonoscopy examinations are typically performed in an endoscopy room environment, where multiple devices such as suction devices, flushing pumps, monitoring equipment, and electrosurgical equipment operate simultaneously. Background noise sources are complex, noise types are diverse, and they vary significantly with the stage of the procedure. Existing voice processing methods often employ fixed noise models or single-threshold noise reduction strategies, making it difficult to differentiate and adaptively process noise from different devices. This can easily weaken the effective doctor's spoken voice during noise reduction or retain a significant amount of residual noise when noise suppression is insufficient, thereby reducing the accuracy and stability of subsequent voice recognition.
[0004] Meanwhile, during gastroscopy and colonoscopy, doctors typically wear masks, which cause significant frequency-selective attenuation of the speech signal, especially affecting high-frequency components. Current technologies largely fail to model and compensate for the impact of mask occlusion on the speech spectrum, leading to decreased clarity of the acquired speech signal and further complicating medical speech recognition.
[0005] Furthermore, existing speech segmentation and speech activity detection methods typically rely on a single speech energy threshold or simple statistical features. In the presence of transient noise from the equipment, natural pauses by the doctor, or non-speech sounds, they struggle to consistently and accurately distinguish between valid spoken speech and invalid segments, easily leading to inaccurate start and end boundaries for speech segments. This not only affects the continuity of speech recognition results but also the temporal alignment accuracy between speech and endoscopic images, thereby reducing the structured nature and usability of the generated examination reports.
[0006] Existing technologies typically determine the presence of speech activity independently based on speech energy, spectral characteristics, or statistical models, without fully considering the objective differences in the probability of physician speech at different endoscopic stages during gastroscopy and colonoscopy. During the endoscopic advancement or retraction phases, equipment noise is high and physician utterances are infrequent, easily leading to false positives. Conversely, during critical stages such as observation or manipulation, physician utterances are dense, and fixed-threshold speech activity detection methods may miss detections. Current solutions often directly use speech activity detection results for subsequent speech recognition processing, lacking a mechanism to verify the logical consistency between speech segments and the endoscopic examination procedure. This results in a mismatch between the generated speech event sequence and the actual examination process, affecting the completeness and reliability of the examination record.
[0007] In view of this, the present invention proposes a system and method for generating gastrointestinal endoscopy reports based on medical voice recognition to solve the above problems. Summary of the Invention
[0008] To overcome the aforementioned deficiencies of the prior art and to achieve the above objectives, the present invention provides the following technical solution: a gastrointestinal endoscopy report generation system based on medical voice recognition, comprising: The voice acquisition and processing module is used to acquire the doctor's real-time spoken voice during the gastrointestinal endoscopy examination, and to perform adaptive noise reduction and mask occlusion compensation on the real-time spoken voice, and output a sequence of voice segments. The voice activity detection module is used to receive voice segment sequences, calculate the prior probability of voice activity by combining the transition relationship of different operation states in the endoscopy examination process, and perform voice activity detection on the voice segment sequences based on the prior probability of voice activity to filter out voice event sequences that are consistent with the examination process. The speech-image alignment module is used to transcribe medical speech into speech event sequences and acquire endoscopic images at the corresponding times. It combines the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. The lesion video capture module is used to automatically capture images of endoscopy based on a preset knowledge base of gastrointestinal endoscopy standards, automatically capture lesion images when lesions are detected, and record the corresponding lesion video. The report generation and interpretation module integrates voice description data, lesion images, and lesion videos to automatically generate gastrointestinal endoscopy reports according to preset report templates. It also associates the reports with doctors' interpretation recordings and generates unique QR codes.
[0009] Preferably, the method for collecting the doctor's real-time spoken voice includes: After detecting the start signal of the gastrointestinal endoscopy examination, the voice acquisition and processing module automatically turns on the voice acquisition function, and continuously acquires the real-time spoken voice produced by the doctor during the examination through the voice acquisition device worn by the doctor. During the voice acquisition process, no operational constraints are imposed on the doctor's dictation behavior, and the acquired real-time dictation is recorded and cached in real time in the form of raw voice signals; at the same time, the real-time dictation and the running time of the gastrointestinal endoscopy are time-stamped synchronously.
[0010] Preferably, the method for outputting a sequence of speech segments includes: A noise spectrum feature library of typical noise sources in the endoscopy room was pre-constructed based on the gastrointestinal endoscopy environment. Each noise source corresponds to the background noise characteristics generated by the suction device, flushing pump, monitoring equipment and electrosurgical equipment during the operation of the endoscopy process. In the real-time speech processing, the acquired real-time spoken speech is subjected to spectral analysis to obtain the spectral features of the speech signal; the similarity between the spectral features of the speech signal and various background noise features in the noise spectral feature library is calculated by cosine similarity to identify the type of background noise that has the greatest impact on the speech signal at the current moment. After identifying the background noise type with the greatest impact, the corresponding noise reduction parameters are selected based on the background noise type, and adaptive noise reduction processing associated with the background noise type is performed on the speech signal, thereby suppressing environmental noise while preserving the effective speech components in the real-time spoken speech. After completing the adaptive noise reduction process, a frequency-selective attenuation function caused by mask occlusion is constructed. The frequency-selective attenuation function divides the speech spectrum into low-frequency and high-frequency ranges according to a preset cutoff frequency, and uses different attenuation characteristics to characterize the effect of the mask on the speech signal. Based on the frequency-selective attenuation function, a corresponding spectrum compensation function is constructed, and regularization constraints are introduced in the compensation process. The speech signal after adaptive noise reduction is input into the spectrum compensation function to compensate for the frequency attenuation caused by mask occlusion, and the speech signal after mask occlusion compensation is obtained. Speech activity detection is performed on the speech signal after adaptive noise reduction and mask occlusion compensation. The speech signal is segmented according to the characteristics of speech energy change and temporal continuity. The speech signal is divided into different speech segments with start and end time markers, and then combined in chronological order to form a speech segment sequence.
[0011] Preferably, the method for calculating the prior probability of speech activity includes: The system receives audio segment sequences and defines a set of operational states of the endoscope during the examination process based on the standard operating procedures for gastrointestinal endoscopy. The set of operational states includes endoscope advancement, endoscope observation, endoscope manipulation, and endoscope retraction. Based on the set of operation states, a state transition probability matrix for endoscopic operation states is constructed. Using the state transition probability matrix, the probability of the endoscope being in each operation state at the current moment is estimated, and the statistical prior of the probability of the doctor's voice activity occurring in each operation state is obtained. Combining the statistical prior of the probability of the doctor's voice activity occurring in each operation state, the prior probability of the voice activity at the current moment is calculated.
[0012] Preferably, the method for filtering the voice event sequence includes: A state-modulated speech activity decision function is constructed, and the speech activity detection threshold of the speech segment sequence is dynamically modulated according to the prior probability of speech activity, so that the speech activity detection threshold is adaptively adjusted with the change of the endoscopic operation state. Based on the judgment result of the speech activity detection threshold, the speech segment sequence is screened and recombined. Speech segments that are inconsistent with the operational state logic of the endoscopy procedure are removed, and only speech segments that are consistent with the operational state of the endoscopy procedure in terms of time sequence and operation state are retained, forming a speech event sequence that matches the endoscopy procedure.
[0013] Preferably, the method for medical speech transcription of speech event sequences includes: The system receives a sequence of voice events, each corresponding to a voice signal with start and end time markers. According to the time order of the voice event sequence, the system extracts the voice signal segments corresponding to each voice event in sequence, and performs MFCC acoustic feature extraction processing on the voice signal segments to convert the time-domain voice signal into an acoustic feature sequence for speech recognition. The acoustic feature sequence is input into a speech recognition model pre-trained for medical scenarios. The speech recognition model decodes the acoustic feature sequence corresponding to each speech event and outputs a text transcription result that corresponds one-to-one with the speech event, forming different transcribed text fragments.
[0014] Preferably, the method for generating speech description data includes: Based on the start and end time markers corresponding to each voice event in the voice event sequence, endoscopic images corresponding to the time interval of the voice event are acquired from the endoscopic video stream generated during the gastrointestinal endoscopy examination, so that each voice event is associated with at least one frame of endoscopic image. Medical image analysis is performed on the acquired endoscopic images to extract visual feature information. Based on a pre-set knowledge base of gastrointestinal endoscopy standards, the examination site information and lesion feature information corresponding to the endoscopic images are identified to obtain image recognition results. Semantic analysis is performed on the image recognition results to identify vague verbal descriptions with unclear location references or incomplete lesion descriptions. The vague verbal descriptions are then associated with the visual feature information at the corresponding time. Using the examination site information and lesion feature information identified in the endoscopic images, the semantics of the vague verbal descriptions are completed to generate speech description data consistent with the actual situation of the endoscopic examination.
[0015] Preferably, the method for automatically capturing an image of the lesion and recording a corresponding video of the lesion upon detection includes: Based on the examination site information and lesion feature information identified in the endoscopic images, and in accordance with the image retention rules set for different examination sites and lesion types in the pre-set gastrointestinal endoscopy standard knowledge base, the endoscopic images are automatically retained and managed. When the image recognition result meets the preset lesion determination conditions in the preset gastrointestinal endoscopy standard knowledge base, the lesion retention mechanism is automatically triggered. At least one frame of endoscopy image containing the lesion area is extracted from the endoscopy video stream and stored as the lesion image. At the same time, the lesion video recording process is triggered. Starting from the time point when the lesion is first identified, endoscopy video data of a preset duration is continuously collected to generate the corresponding lesion video.
[0016] Preferably, the method for generating a unique identifier QR code includes: The voice description data, lesion images and lesion videos are integrated, with the voice event sequence as the time line. Based on the examination site information and lesion feature information in the voice description data, the lesion images and lesion videos stored in the same time interval are combined with the corresponding voice description data to form a multimodal examination record that includes examination site, lesion description and imaging evidence. According to the preset gastrointestinal endoscopy report template, the multimodal examination records are structured and arranged, and the voice description data is automatically filled into the corresponding examination process record and lesion description column in the gastrointestinal endoscopy report template. The corresponding lesion images or lesion video indexes are displayed in the corresponding positions to generate a gastrointestinal endoscopy report that conforms to the clinical gastrointestinal endoscopy examination standards. After generating the endoscopy report, the system receives a recorded audio interpretation of the examination results from the doctor and binds the audio interpretation with the corresponding endoscopy report for storage, thus creating an integrated examination file. A unique identifier is generated based on the endoscopy report and encoded as a QR code and embedded into the endoscopy report.
[0017] A method for generating gastrointestinal endoscopy reports based on medical speech recognition includes: S1. During the gastrointestinal endoscopy examination, the doctor's real-time spoken voice is collected, and adaptive noise reduction and mask occlusion compensation are performed on the real-time spoken voice to output a sequence of voice segments. S2. Receive the speech segment sequence, combine the transition relationship of different operation states in the endoscopy examination process, calculate the prior probability of speech activity, and perform speech activity detection on the speech segment sequence according to the prior probability of speech activity to filter out the speech event sequence consistent with the examination process. S3. Perform medical speech transcription on the speech event sequence and collect endoscopic images at the corresponding time. Combine the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. S4. Based on the preset knowledge base of gastrointestinal endoscopy standards, automatically capture endoscopic images, automatically capture lesion images when lesions are detected, and record corresponding lesion videos. S5 integrates voice description data, lesion images, and lesion videos, automatically generates gastrointestinal endoscopy reports according to preset report templates, and associates the reports with doctor's interpretation recordings, generating a unique QR code.
[0018] Compared with the prior art, the present invention has the following beneficial effects: By pre-constructing a typical noise spectrum feature library that matches the gastrointestinal endoscopy environment and introducing a noise type discrimination mechanism based on cosine similarity in real-time processing, dynamic identification of the dominant background noise type is achieved. This enables adaptive selection of noise reduction parameters according to the noise type, significantly improving the ability to preserve effective spoken speech in complex endoscopy environments. By constructing a frequency-selective attenuation function caused by mask occlusion and its regularized spectrum compensation model, targeted compensation is made for the non-uniform frequency attenuation caused by masks, effectively restoring the most critical high-frequency information in speech for recognition and improving the clarity and recognizability of medical speech in real examination scenarios. At the same time, by combining speech energy changes and temporal continuity features for speech activity detection, stable determination of the start and end positions of speech segments is achieved, avoiding missegmentation problems caused by transient noise and short pauses, forming a speech segment sequence consistent with the timeline of the examination process.
[0019] By incorporating the standard operating procedures of gastrointestinal endoscopy into the speech activity detection process, and through operational state modeling and prior probability calculation of speech activities, speech detection gains process awareness capabilities, significantly reducing false positives and false negatives caused by equipment noise and environmental interference. A dynamic speech activity decision mechanism based on state modulation enables the speech detection threshold to adaptively change with the endoscopic operation state, improving the robustness and stability of speech activity detection in complex medical acoustic environments. By performing process consistency screening and reorganization on speech segment sequences, the final speech event sequence effectively ensures a high degree of matching between the temporal structure and semantic flow and the gastrointestinal endoscopy process, providing a highly reliable speech input foundation for medical speech recognition and examination report generation. Attached Figure Description
[0020] Figure 1 This is a schematic diagram of the structure of a gastrointestinal endoscopy report generation system based on medical voice recognition according to the present invention; Figure 2 This is a schematic diagram of a method for generating gastrointestinal endoscopy reports based on medical voice recognition according to the present invention. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Example
[0022] Please see Figure 1 As shown, this embodiment provides a system for generating gastrointestinal endoscopy reports based on medical voice recognition, specifically including the following steps: The voice acquisition and processing module is used to acquire the doctor's real-time spoken voice during the gastrointestinal endoscopy examination, and to perform adaptive noise reduction and mask occlusion compensation on the real-time spoken voice, and output a sequence of voice segments. The voice activity detection module is used to receive voice segment sequences, calculate the prior probability of voice activity by combining the transition relationship of different operation states in the endoscopy examination process, and perform voice activity detection on the voice segment sequences based on the prior probability of voice activity to filter out voice event sequences that are consistent with the examination process. The speech-image alignment module is used to transcribe medical speech into speech event sequences and acquire endoscopic images at the corresponding times. It combines the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. The lesion video capture module is used to automatically capture images of endoscopy based on a preset knowledge base of gastrointestinal endoscopy standards, automatically capture lesion images when lesions are detected, and record the corresponding lesion video. The report generation and interpretation module integrates voice description data, lesion images, and lesion videos to automatically generate gastrointestinal endoscopy reports according to preset report templates. It also associates the reports with doctors' interpretation recordings and generates unique QR codes.
[0023] Methods for collecting doctors' real-time spoken audio include: After detecting the start signal of the gastrointestinal endoscopy examination, the voice acquisition and processing module automatically turns on the voice acquisition function, and continuously acquires the real-time spoken voice produced by the doctor during the examination through the voice acquisition device worn by the doctor. During the voice acquisition process, no operational constraints are imposed on the doctor's dictation behavior, and the acquired real-time dictation is recorded and cached in real time in the form of raw voice signals; at the same time, the real-time dictation and the running time of the gastrointestinal endoscopy are time-stamped synchronously.
[0024] Methods for outputting speech segment sequences include: A noise spectrum feature library of typical noise sources in the endoscopy room was pre-constructed based on the gastrointestinal endoscopy environment. Each noise source corresponds to the background noise characteristics generated by the suction device, flushing pump, monitoring equipment and electrosurgical equipment during the operation of the endoscopy process. It should be noted that the construction process of the noise spectrum feature library includes: before system deployment, selecting the actual gastrointestinal endoscopy environment as the data acquisition scenario, and sampling and recording the background noise generated by the suction device, flushing pump, monitoring equipment, and electrosurgical equipment under different working conditions without doctor's narration or deliberate avoidance of voice interference. The acquired noise signals are uniformly converted and framed, and spectrum analysis is performed on each type of noise signal to extract its power distribution characteristics in each frequency dimension. Subsequently, the spectrum features obtained from the same noise source under different working conditions are statistically summarized and normalized to form a feature vector that can stably characterize the spectrum morphology of the noise source, and stored with the noise source type as the index, thereby constructing a typical noise spectrum feature library of endoscopy room noise sources that matches the gastrointestinal endoscopy environment.
[0025] In real-time speech processing, spectral analysis is performed on the acquired real-time spoken speech to obtain the spectral features of the speech signal. Cosine similarity is then used to calculate the similarity between the spectral features of the speech signal and various background noise features in a noise spectral feature database. ;in, This represents the index of the noise type that best matches the current speech spectrum, determined by cosine similarity. This represents a noise type index variable used to distinguish different predefined noise categories; Noise The first in the acoustic spectrum feature library A reference spectrum vector for noise types, used to characterize the statistical properties of noise from a specific device or environment in the frequency domain; It represents the spectral feature vector of the real-time acquired speech signal in the frequency domain within the current time frame, and is used to characterize the overall spectral distribution characteristics of the current speech segment; This represents the vector transpose operation, which converts the spectral feature vector from column vector form to row vector form, and is used to calculate the inner product between spectral feature vectors to identify the type of background noise that has the greatest impact on the speech signal at the current moment. After identifying the background noise type with the greatest impact, the corresponding noise reduction parameters are selected based on the background noise type, and adaptive noise reduction processing associated with the background noise type is performed on the speech signal, thereby suppressing environmental noise while preserving the effective speech components in the real-time spoken speech. Adaptive noise reduction processing is as follows: ;in, This represents the amplitude of the target speech spectrum after noise suppression processing. Indicates the current speech signal at a frequency The original spectral amplitude at that location; This represents the spectral reduction intensity coefficient related to the noise type, used to adjust the noise suppression amplitude under different noise conditions; Indicates the current noise type. The corresponding noise spectrum estimate; This represents the threshold bias term for the lower limit of spectral subtraction decision, used to avoid excessive spectral subtraction under low signal-to-noise ratio conditions; This represents the residual speech preservation coefficient, which is used to impose a lower limit constraint on speech energy when the spectral subtraction condition is not met, preventing the speech from disappearing completely; After completing the adaptive noise reduction process, in order to address the problem that the current speech signal will be attenuated in different frequency ranges due to doctors wearing masks during gastrointestinal endoscopy, a frequency-selective attenuation function caused by mask occlusion is constructed. The frequency-selective attenuation function divides the speech spectrum into low-frequency range and high-frequency range according to a preset cutoff frequency, and uses different attenuation characteristics to characterize the impact of the mask on the speech signal. The frequency-selective attenuation function is: ;in, This represents the frequency-selective attenuation function under mask occlusion conditions, used to characterize the attenuation ratio of the voice signal at different frequencies by the mask; Represents a frequency variable, used to describe different frequency components of a speech signal in the frequency domain; This represents the low-frequency attenuation coefficient, used to control the degree to which the mask attenuates the energy of low-frequency speech components; This indicates the preset cutoff frequency, which serves as the frequency boundary point for dividing the low-frequency attenuation range into the high-frequency attenuation range. This represents the high-frequency attenuation coefficient, used to characterize the maximum attenuation of high-frequency speech components by the mask; The frequency spread parameter represents the high-frequency attenuation distribution and is used to control the broadening of the high-frequency attenuation curve in the frequency domain; Based on the frequency-selective attenuation function, a corresponding spectrum compensation function is constructed, and regularization constraints are introduced in the compensation process. The speech signal after adaptive noise reduction is input into the spectrum compensation function to compensate for the frequency attenuation caused by mask occlusion, and the speech signal after mask occlusion compensation is obtained. The spectrum compensation function is: ;in, This represents a spectral compensation function constructed for the frequency-selective attenuation of masks, used for energy recovery of occluded speech in the frequency domain; This represents the regularization parameter, used to prevent... Numerical divergence occurs when the value approaches zero, and the high-frequency compensation gain is limited to ensure the stability and feasibility of the compensation process. Speech activity detection is performed on the speech signal after adaptive noise reduction and mask occlusion compensation. The speech signal is segmented according to the characteristics of speech energy change and temporal continuity. The speech signal is divided into different speech segments with start and end time markers, and then combined in chronological order to form a speech segment sequence.
[0026] It should be noted that the process of dividing the speech signal into different speech segments with start and end time markers specifically includes: performing continuous frame-by-frame processing on the speech signal after noise reduction and mask occlusion compensation, calculating the speech energy frame by frame to characterize whether there is valid spoken speech in the current time slice; when the speech energy of several consecutive frames is higher than the background energy level, it is determined that the speech state has been entered, and the time point when the condition is first met is determined as the start time of the speech segment; when the speech energy of several consecutive frames is lower than the background energy level, it is determined that the speech state has ended, and the corresponding time point is determined as the end time of the speech segment; by introducing time continuity constraints in the process of determining the presence and disappearance of speech, missegmentation caused by instantaneous noise or short pauses is avoided, thereby stably dividing the continuous speech signal into multiple speech segments with clear start and end time markers.
[0027] Methods for calculating the prior probability of speech activity include: The system receives audio segment sequences and defines a set of operational states of the endoscope during the examination process based on the standard operating procedures for gastrointestinal endoscopy. The set of operational states includes endoscope advancement, endoscope observation, endoscope manipulation, and endoscope retraction. Based on the set of operational states, construct the state transition probability matrix for the endoscopic operational states; the state transition probability matrix is: ;in, This indicates that the endoscope changed from an operational state to an operational state within an adjacent time period. Transition to operational status The probability of; and An index indicating the operation status; By using the state transition probability matrix, the probability of the endoscope being in each operational state at the current moment is estimated, and the statistical prior of the probability of the doctor's voice activity occurring in each operational state is obtained; combined with the statistical prior of the probability of the doctor's voice activity occurring in each operational state, the prior probability of the voice activity at the current moment is calculated.
[0028] The prior probability of voice activity is: ;in, Indicates at time Prior probability of speech activity; This indicates that the endoscope is in operation. Under certain conditions, the conditional probability of a doctor engaging in vocal activity is obtained through historical statistics or training data. Indicates at time The endoscope is in operation. The probability is usually calculated from the previous operation state and the state transition probability matrix. Represents any operation state in the set of endoscopic operation states; Index representing the time.
[0029] Methods for filtering speech event sequences include: A state-modulated speech activity decision function is constructed, and the speech activity detection threshold of the speech segment sequence is dynamically modulated according to the prior probability of speech activity, so that the speech activity detection threshold is adaptively adjusted with the change of the endoscopic operation state. Voice activity decision function: ;in, Indicates at time The voice activity detection decision result has a value of 1 indicating that voice activity was detected and a value of 0 indicating that voice activity was not detected. This indicates an indicator function that outputs 1 if the condition within the square brackets is true, and 0 otherwise. Indicates time The speech energy feature value is usually calculated from the short-time energy or weighted energy of the speech signal in the current time frame; This indicates the threshold for voice activity detection, used to distinguish between voice and non-voice signals; This represents the prior probability modulation coefficient, which is used to control the influence of the prior probability of speech activity on the speech activity detection threshold. Its value range is usually from 0 to 1. Based on the judgment result of the speech activity detection threshold, the speech segment sequence is screened and recombined. Speech segments that are inconsistent with the operational state logic of the endoscopy procedure are removed, and only speech segments that are consistent with the operational state of the endoscopy procedure in terms of time sequence and operation state are retained, forming a speech event sequence that matches the endoscopy procedure.
[0030] Methods for medical speech transcription of speech event sequences include: The system receives a sequence of voice events, each corresponding to a voice signal with start and end time markers. According to the time order of the voice event sequence, the system extracts the voice signal segments corresponding to each voice event in sequence, and performs MFCC acoustic feature extraction processing on the voice signal segments to convert the time-domain voice signal into an acoustic feature sequence for speech recognition. The acoustic feature sequence is input into a speech recognition model pre-trained for medical scenarios. The speech recognition model decodes the acoustic feature sequence corresponding to each speech event and outputs a text transcription result that corresponds one-to-one with the speech event, forming different transcribed text fragments.
[0031] The training process of the speech recognition model includes: collecting and constructing a medical speech recognition dataset, which includes acoustic feature sequences and their corresponding text transcription results; and dividing the dataset into training set, validation set and test set according to a preset ratio. A medical speech recognition model is constructed by using the acoustic feature sequence obtained after acoustic feature extraction from speech samples as the model input and the corresponding text transcription result (text annotation or character sequence) as the model output. The speech recognition model includes an input layer, at least one hidden layer, and an output layer. The input layer is used to receive the acoustic feature sequence, the hidden layer is used to model the nonlinear mapping relationship between acoustic features and speech units, and the output layer is used to output the corresponding text transcription result. The speech recognition model is a multilayer perceptron model. During model training, the cross-entropy loss function is used to measure the difference between the model output and the real annotation, and the model parameters are updated through the backpropagation algorithm. Stochastic gradient descent or its improved algorithm is used as the optimizer to iteratively train the model based on the training set, and the recognition performance of the model is evaluated through the validation set. The model parameters are then tuned based on the validation set evaluation results. Training is stopped when the recognition performance on the validation set reaches a preset threshold or no longer improves in several consecutive iterations. Finally, the performance of the trained speech recognition model is evaluated using the test set, and the trained model is used to transcribe medical speech signals in speech event sequences.
[0032] Methods for generating speech description data include: Based on the start and end time markers corresponding to each voice event in the voice event sequence, endoscopic images corresponding to the time interval of the voice event are acquired from the endoscopic video stream generated during the gastrointestinal endoscopy examination, so that each voice event is associated with at least one frame of endoscopic image. Medical image analysis is performed on the acquired endoscopic images to extract visual feature information. Based on a pre-set knowledge base of gastrointestinal endoscopy standards, the examination site information and lesion feature information corresponding to the endoscopic images are identified to obtain image recognition results. It should be noted that after acquiring the endoscopic images corresponding to the time interval of the voice event, medical image analysis processing is first performed on the endoscopic images. This process includes brightness equalization, color correction, and noise suppression of the raw endoscopic images to eliminate interference caused by changes in lighting, lens angle, or body fluid reflection. Based on this, region analysis and feature extraction are performed on the endoscopic images to extract visual feature information that can characterize the morphology, texture distribution, color changes, and structural contours of the mucosal surface, so that the extracted visual features can reflect the typical medical imaging features within the current endoscopic field of view.
[0033] The pre-built gastrointestinal endoscopy standard knowledge base was constructed before system deployment. Its content is organized based on clinical gastrointestinal endoscopy examination standards and standard reporting requirements, including anatomical descriptions of standard examination sites in the gastrointestinal tract, typical visual features of different sites in the endoscopic field of view, morphological features of common lesion types, and their corresponding medical description rules. The knowledge base is stored in a structured format, establishing a clear correspondence between examination sites and their visual feature patterns, and between lesion types and their imaging manifestations, thus providing a basis for subsequent image recognition. Based on this, the visual feature information extracted from the endoscopic image is matched with the site feature model and lesion feature model in the pre-built gastrointestinal endoscopy standard knowledge base for similarity. Based on the matching results, the examination site information corresponding to the current endoscopic image is identified, and the presence of abnormal structures or lesion features is determined, thereby obtaining image recognition results including the examination site type and lesion feature description.
[0034] Semantic analysis is performed on the image recognition results to identify vague verbal descriptions with unclear location references or incomplete lesion descriptions. The vague verbal descriptions are then associated with the visual feature information at the corresponding time. Using the examination site information and lesion feature information identified in the endoscopic images, the semantics of the vague verbal descriptions are completed to generate speech description data consistent with the actual situation of the endoscopic examination.
[0035] It should be noted that semantic parsing is performed on the image recognition results to convert the examination site information and lesion feature information obtained from image recognition into semantic units that can correspond to speech text; at the same time, semantic analysis is performed on the text transcription results corresponding to the speech events to determine whether there are vague spoken contents that only use pronouns, indicative words or ellipsis without clearly indicating the examination site or lesion features, thereby identifying speech description segments that need semantic completion.
[0036] After recognizing the ambiguous spoken content, based on the temporal correspondence between the speech event and the endoscopic image, the ambiguous spoken content is associated with the visual feature information at the corresponding moment. Using the clear examination site information and lesion feature information in the endoscopic image recognition results, the ambiguous spoken content is supplemented or semantically replaced, so that the speech description that originally depended on the context to understand is transformed into semantically complete and clearly directed medical description content, thereby generating speech description data consistent with the actual situation of the endoscopic examination.
[0037] It should be noted that, at the semantic level, natural language semantic analysis is performed on the speech-to-text to identify referential words, location words, or ellipsis descriptions, such as vague spoken content that only describes "here," "this area," or "something is abnormal," without containing specific anatomical locations or lesion names. Simultaneously, semantic mapping is performed on the endoscopic image recognition results within the corresponding time window, converting the examination site information and lesion feature information identified in the images into standardized semantic labels. By calculating the matching relationship between speech semantic labels and image semantic labels, the visual semantic information most likely corresponding to the missing semantic components in the speech is determined, thereby establishing semantic-level associations.
[0038] After completing both temporal and semantic matching, a confidence assessment mechanism is introduced to constrain the association results. The system comprehensively evaluates the association results based on the confidence of the image recognition results, the degree of semantic incompleteness of the speech, and the consistency of visual features within the time window. Only when the image recognition confidence meets a preset threshold is the visual feature information confirmed as usable for semantic completion of ambiguous spoken content; otherwise, the speech event is marked as requiring manual confirmation or the original spoken content is retained to avoid medical risks introduced by erroneous completion.
[0039] Methods for automatically capturing images of lesions and recording corresponding videos of lesions upon detection include: Based on the examination site information and lesion feature information identified in the endoscopic images, and in accordance with the image retention rules set for different examination sites and lesion types in the pre-set gastrointestinal endoscopy standard knowledge base, the endoscopic images are automatically retained and managed. It should be noted that the pre-defined knowledge base for gastrointestinal endoscopy standards structures the image retention requirements for different examination sites in the endoscopy guidelines into executable rules. Each image retention rule contains at least three types of constraint information: first, the examination site constraint, which limits the current image to specific anatomical sites such as the esophagus, gastric body, gastric antrum, and duodenum; second, the lesion type or state constraint, which distinguishes between normal mucosa, suspicious lesions, or clear lesions; and third, the image retention trigger condition, which limits the image retention to be performed only when the standard viewing angle, clarity, and coverage are met.
[0040] In practice, after obtaining the examination site information and lesion feature information corresponding to the endoscopic image, the system matches the site identifier and lesion type label of the current image with the image retention rules in the standardized knowledge base. When a matching rule is found and the current image meets the image quality requirements of that rule (e.g., the lesion area is in the center of the field of view, the boundary is clear, and there is no obstruction), the system automatically marks the frame image as an image retention object and completes storage, numbering, and archiving according to the rule requirements, thereby achieving automatic image retention management without manual intervention.
[0041] When the image recognition result meets the preset lesion determination conditions in the preset gastrointestinal endoscopy standard knowledge base, the lesion retention mechanism is automatically triggered. At least one frame of endoscopy image containing the lesion area is extracted from the endoscopy video stream and stored as the lesion image. At the same time, the lesion video recording process is triggered. Starting from the time point when the lesion is first identified, endoscopy video data of a preset duration is continuously collected to generate the corresponding lesion video.
[0042] It should be noted that the preset lesion determination criteria also come from the gastrointestinal endoscopy standard knowledge base, but their function is not to record images, but to determine whether to proceed to the lesion-level processing flow. These criteria are usually composed of multiple features, including but not limited to: whether abnormal tissue regions are identified in the image, whether the morphological characteristics of the abnormal regions meet the determination threshold of a certain type of lesion, the stability of the abnormal regions in consecutive frames, and whether the abnormal regions are defined as lesion types that need to be recorded in the standard knowledge base.
[0043] During system operation, when image recognition results indicate the presence of an abnormal region in the current endoscopic image that meets the aforementioned lesion detection criteria, a lesion is detected. Once this determination is established, the system automatically triggers a lesion retention mechanism. This mechanism not only extracts keyframes containing the lesion region for storage as lesion images, but also initiates a lesion video recording process, using the time point when the lesion is first identified as the starting reference. This process continuously collects video data for a preset duration to fully record the dynamic manifestations of the lesion and the examination process.
[0044] Methods for generating unique QR codes include: The voice description data, lesion images and lesion videos are integrated, with the voice event sequence as the time line. Based on the examination site information and lesion feature information in the voice description data, the lesion images and lesion videos stored in the same time interval are combined with the corresponding voice description data to form a multimodal examination record that includes examination site, lesion description and imaging evidence. According to the preset gastrointestinal endoscopy report template, the multimodal examination records are structured and arranged, and the voice description data is automatically filled into the corresponding examination process record and lesion description column in the gastrointestinal endoscopy report template. The corresponding lesion images or lesion video indexes are displayed in the corresponding positions to generate a gastrointestinal endoscopy report that conforms to the clinical gastrointestinal endoscopy examination standards. After generating the endoscopy report, the system receives a recorded audio interpretation of the examination results from the doctor and binds the audio interpretation with the corresponding endoscopy report for storage, thus creating an integrated examination file. A unique identifier is generated based on the endoscopy report and encoded as a QR code and embedded into the endoscopy report.
[0045] It should be noted that the pre-set gastrointestinal endoscopy report template is based on clinical guidelines and actual writing habits for gastrointestinal endoscopy. Before system deployment, the core elements that a gastrointestinal endoscopy report should include were identified based on industry-published digestive endoscopy guidelines, hospital internal standard report formats, and commonly used writing templates by doctors. These core elements include basic patient information, type and scope of examination, description of the examination process, findings at each anatomical location, nature and size of lesions, diagnostic conclusions, and treatment recommendations. This process clarified which fields are mandatory, which are optional, and the rules for displaying or hiding fields in different examination scenarios.
[0046] The speech activity detection threshold is set by staff. It is obtained by collecting different speech energy feature values and taking the average of multiple speech energy feature values as the speech activity detection threshold.
[0047] In this embodiment, by pre-constructing a typical noise spectrum feature library that matches the gastrointestinal endoscopy environment, and introducing a noise type discrimination mechanism based on cosine similarity in real-time processing, dynamic identification of the dominant background noise type is achieved. This enables adaptive selection of noise reduction parameters according to the noise type, significantly improving the ability to preserve effective spoken speech in complex endoscopy environments. By constructing a frequency-selective attenuation function caused by mask occlusion and its regularized spectrum compensation model, targeted compensation is made for the non-uniform frequency attenuation caused by masks, effectively restoring the most critical high-frequency information in speech for recognition, and improving the clarity and recognizability of medical speech in real examination scenarios. At the same time, by combining speech energy changes and temporal continuity features for speech activity detection, stable determination of the start and end positions of speech segments is achieved, avoiding missegmentation problems caused by transient noise and short pauses, forming a speech segment sequence consistent with the timeline of the examination process.
[0048] By incorporating the standard operating procedures of gastrointestinal endoscopy into the speech activity detection process, and through operational state modeling and prior probability calculation of speech activities, speech detection gains process awareness capabilities, significantly reducing false positives and false negatives caused by equipment noise and environmental interference. A dynamic speech activity decision mechanism based on state modulation enables the speech detection threshold to adaptively change with the endoscopic operation state, improving the robustness and stability of speech activity detection in complex medical acoustic environments. By performing process consistency screening and reorganization on speech segment sequences, the final speech event sequence effectively ensures a high degree of matching between the temporal structure and semantic flow and the gastrointestinal endoscopy process, providing a highly reliable speech input foundation for medical speech recognition and examination report generation. Example
[0049] Please see Figure 2 As shown, for parts not described in detail in this embodiment, please refer to the description in Embodiment 1. A method for generating gastrointestinal endoscopy reports based on medical voice recognition is provided, including: S1. During the gastrointestinal endoscopy examination, the doctor's real-time spoken voice is collected, and adaptive noise reduction and mask occlusion compensation are performed on the real-time spoken voice to output a sequence of voice segments. S2. Receive the speech segment sequence, combine the transition relationship of different operation states in the endoscopy examination process, calculate the prior probability of speech activity, and perform speech activity detection on the speech segment sequence according to the prior probability of speech activity to filter out the speech event sequence consistent with the examination process. S3. Perform medical speech transcription on the speech event sequence and collect endoscopic images at the corresponding time. Combine the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. S4. Based on the preset knowledge base of gastrointestinal endoscopy standards, automatically capture endoscopic images, automatically capture lesion images when lesions are detected, and record corresponding lesion videos. S5 integrates voice description data, lesion images, and lesion videos, automatically generates gastrointestinal endoscopy reports according to preset report templates, and associates the reports with doctor's interpretation recordings, generating a unique QR code.
[0050] Since the electronic device described in this embodiment is the electronic device used to implement the gastrointestinal endoscopy report generation system and method based on medical voice recognition described in this application embodiment, those skilled in the art can understand the specific implementation method and various variations of the electronic device in this embodiment based on the gastrointestinal endoscopy report generation system and method based on medical voice recognition described in this application embodiment. Therefore, how the electronic device implements the method in this application embodiment will not be described in detail here. As long as those skilled in the art implement the gastrointestinal endoscopy report generation system and method based on medical voice recognition described in this application embodiment, the electronic device used is within the scope of protection of this application.
[0051] The above formulas are all dimensionless calculations. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters and thresholds in the formulas are set by those skilled in the art according to the actual situation.
[0052] The above description is merely a preferred embodiment of the present invention, and the scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for users of ordinary technical skills, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.
Claims
1. A system for generating gastrointestinal endoscopy reports based on medical voice recognition, characterized in that, include: The voice acquisition and processing module is used to acquire the doctor's real-time spoken voice during the gastrointestinal endoscopy examination, and to perform adaptive noise reduction and mask occlusion compensation on the real-time spoken voice, and output a sequence of voice segments. The voice activity detection module is used to receive voice segment sequences, calculate the prior probability of voice activity by combining the transition relationship of different operation states in the endoscopy examination process, and perform voice activity detection on the voice segment sequences based on the prior probability of voice activity to filter out voice event sequences that are consistent with the examination process. The speech-image alignment module is used to transcribe medical speech into speech event sequences and acquire endoscopic images at the corresponding times. It combines the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. The lesion video capture module is used to automatically capture images of endoscopy based on a preset knowledge base of gastrointestinal endoscopy standards, automatically capture lesion images when lesions are detected, and record the corresponding lesion video. The report generation and interpretation module integrates voice description data, lesion images, and lesion videos to automatically generate gastrointestinal endoscopy reports according to preset report templates. It also associates the reports with doctors' interpretation recordings and generates unique QR codes.
2. The gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 1, characterized in that, The method for collecting doctors' real-time spoken voice includes: After detecting the start signal of the gastrointestinal endoscopy examination, the voice acquisition and processing module automatically turns on the voice acquisition function, and continuously acquires the real-time spoken voice produced by the doctor during the examination through the voice acquisition device worn by the doctor. During the voice acquisition process, no operational constraints are imposed on the doctor's dictation behavior, and the acquired real-time dictation is recorded and cached in real time in the form of raw voice signals; at the same time, the real-time dictation and the running time of the gastrointestinal endoscopy are time-stamped synchronously.
3. The gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 2, characterized in that, The method for outputting the speech segment sequence includes: A noise spectrum feature library of typical noise sources in the endoscopy room was pre-constructed based on the gastrointestinal endoscopy environment. Each noise source corresponds to the background noise characteristics generated by the suction device, flushing pump, monitoring equipment and electrosurgical equipment during the operation of the endoscopy process. In the real-time speech processing, the acquired real-time spoken speech is subjected to spectral analysis to obtain the spectral features of the speech signal; the similarity between the spectral features of the speech signal and various background noise features in the noise spectral feature library is calculated by cosine similarity to identify the type of background noise that has the greatest impact on the speech signal at the current moment. After identifying the background noise type with the greatest impact, the corresponding noise reduction parameters are selected based on the background noise type, and adaptive noise reduction processing associated with the background noise type is performed on the speech signal, thereby suppressing environmental noise while preserving the effective speech components in the real-time spoken speech. After completing the adaptive noise reduction process, a frequency-selective attenuation function caused by mask occlusion is constructed. The frequency-selective attenuation function divides the speech spectrum into low-frequency and high-frequency ranges according to a preset cutoff frequency, and uses different attenuation characteristics to characterize the effect of the mask on the speech signal. Based on the frequency-selective attenuation function, a corresponding spectrum compensation function is constructed, and regularization constraints are introduced in the compensation process. The speech signal after adaptive noise reduction is input into the spectrum compensation function to compensate for the frequency attenuation caused by mask occlusion, and the speech signal after mask occlusion compensation is obtained. Speech activity detection is performed on the speech signal after adaptive noise reduction and mask occlusion compensation. The speech signal is segmented according to the characteristics of speech energy change and temporal continuity. The speech signal is divided into different speech segments with start and end time markers, and then combined in chronological order to form a speech segment sequence.
4. The gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 3, characterized in that, The method for calculating the prior probability of speech activity includes: The system receives audio segment sequences and defines a set of operational states of the endoscope during the examination process based on the standard operating procedures for gastrointestinal endoscopy. The set of operational states includes endoscope advancement, endoscope observation, endoscope manipulation, and endoscope retraction. Based on the set of operation states, a state transition probability matrix for endoscopic operation states is constructed. Using the state transition probability matrix, the probability of the endoscope being in each operation state at the current moment is estimated, and the statistical prior of the probability of the doctor's voice activity occurring in each operation state is obtained. Combining the statistical prior of the probability of the doctor's voice activity occurring in each operation state, the prior probability of the voice activity at the current moment is calculated.
5. A gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 4, characterized in that, The method for filtering the voice event sequence includes: A state-modulated speech activity decision function is constructed, and the speech activity detection threshold of the speech segment sequence is dynamically modulated according to the prior probability of speech activity, so that the speech activity detection threshold is adaptively adjusted with the change of the endoscopic operation state. Based on the judgment result of the speech activity detection threshold, the speech segment sequence is screened and recombined. Speech segments that are inconsistent with the operational state logic of the endoscopy procedure are removed, and only speech segments that are consistent with the operational state of the endoscopy procedure in terms of time sequence and operation state are retained, forming a speech event sequence that matches the endoscopy procedure.
6. The gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 5, characterized in that, The method for medical speech transcription of speech event sequences includes: The system receives a sequence of voice events, each corresponding to a voice signal with start and end time markers. According to the time order of the voice event sequence, the system extracts the voice signal segments corresponding to each voice event in sequence, and performs MFCC acoustic feature extraction processing on the voice signal segments to convert the time-domain voice signal into an acoustic feature sequence for speech recognition. The acoustic feature sequence is input into a speech recognition model pre-trained for medical scenarios. The speech recognition model decodes the acoustic feature sequence corresponding to each speech event and outputs a text transcription result that corresponds one-to-one with the speech event, forming different transcribed text fragments.
7. The gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 6, characterized in that, The method for generating speech description data includes: Based on the start and end time markers corresponding to each voice event in the voice event sequence, endoscopic images corresponding to the time interval of the voice event are acquired from the endoscopic video stream generated during the gastrointestinal endoscopy examination, so that each voice event is associated with at least one frame of endoscopic image. Medical image analysis is performed on the acquired endoscopic images to extract visual feature information. Based on a pre-set knowledge base of gastrointestinal endoscopy standards, the examination site information and lesion feature information corresponding to the endoscopic images are identified to obtain image recognition results. Semantic analysis is performed on the image recognition results to identify vague verbal descriptions with unclear location references or incomplete lesion descriptions. The vague verbal descriptions are then associated with the visual feature information at the corresponding time. Using the examination site information and lesion feature information identified in the endoscopic images, the semantics of the vague verbal descriptions are completed to generate speech description data consistent with the actual situation of the endoscopic examination.
8. A gastrointestinal endoscopy report generation system based on medical voice recognition according to claim 7, characterized in that, The method for automatically capturing an image of the lesion and recording a corresponding video of the lesion upon detection includes: Based on the examination site information and lesion feature information identified in the endoscopic images, and in accordance with the image retention rules set for different examination sites and lesion types in the pre-set gastrointestinal endoscopy standard knowledge base, the endoscopic images are automatically retained and managed. When the image recognition result meets the preset lesion determination conditions in the preset gastrointestinal endoscopy standard knowledge base, the lesion retention mechanism is automatically triggered. At least one frame of endoscopy image containing the lesion area is extracted from the endoscopy video stream and stored as the lesion image. At the same time, the lesion video recording process is triggered. Starting from the time point when the lesion is first identified, endoscopy video data of a preset duration is continuously collected to generate the corresponding lesion video.
9. A system for generating gastrointestinal endoscopy reports based on medical voice recognition according to claim 8, characterized in that, The method for generating a unique identifier QR code includes: The voice description data, lesion images and lesion videos are integrated, with the voice event sequence as the time line. Based on the examination site information and lesion feature information in the voice description data, the lesion images and lesion videos stored in the same time interval are combined with the corresponding voice description data to form a multimodal examination record that includes examination site, lesion description and imaging evidence. According to the preset gastrointestinal endoscopy report template, the multimodal examination records are structured and arranged, and the voice description data is automatically filled into the corresponding examination process record and lesion description column in the gastrointestinal endoscopy report template. The corresponding lesion images or lesion video indexes are displayed in the corresponding positions to generate a gastrointestinal endoscopy report that conforms to the clinical gastrointestinal endoscopy examination standards. After generating the endoscopy report, the system receives a recorded audio interpretation of the examination results from the doctor and binds the audio interpretation with the corresponding endoscopy report for storage, thus creating an integrated examination file. A unique identifier is generated based on the endoscopy report and encoded as a QR code and embedded into the endoscopy report.
10. A method for generating gastrointestinal endoscopy reports based on medical speech recognition, implemented using a gastrointestinal endoscopy report generation system based on medical speech recognition as described in any one of claims 1 to 9, characterized in that... include: S1. During the gastrointestinal endoscopy examination, the doctor's real-time spoken voice is collected, and adaptive noise reduction and mask occlusion compensation are performed on the real-time spoken voice to output a sequence of voice segments. S2. Receive the speech segment sequence, combine the transition relationship of different operation states in the endoscopy examination process, calculate the prior probability of speech activity, and perform speech activity detection on the speech segment sequence according to the prior probability of speech activity to filter out the speech event sequence consistent with the examination process. S3. Perform medical speech transcription on the speech event sequence and collect endoscopic images at the corresponding time. Combine the endoscopic images to complete the ambiguous spoken content in the speech event sequence and generate speech description data. S4. Based on the preset knowledge base of gastrointestinal endoscopy standards, automatically capture endoscopic images, automatically capture lesion images when lesions are detected, and record corresponding lesion videos. S5 integrates voice description data, lesion images, and lesion videos, automatically generates gastrointestinal endoscopy reports according to preset report templates, and associates the reports with doctor's interpretation recordings, generating a unique QR code.