A medical image data processing method based on deep learning
By collecting multimodal data, establishing data relationship models, and predicting and identifying risky lesions, this approach addresses the problem of outdated imaging equipment in primary healthcare institutions, enabling efficient fusion of imaging data and intelligent diagnosis, thereby improving diagnostic accuracy and efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING JINZHAO TONGHUI TECHNOLOGY CO LTD
- Filing Date
- 2025-11-10
- Publication Date
- 2026-06-26
AI Technical Summary
Outdated imaging equipment in primary healthcare institutions leads to inaccurate diagnostic results, low image quality, inconsistent operating standards, and insufficient diagnostic experience, all of which affect diagnostic accuracy and efficiency.
Multimodal data is collected, phase discrimination steady state is tracked, data relationship model is established, image steady-state sequence is registered and compensated, risk lesions are predicted and identified, and suggestions are provided for auxiliary guidance. The patient's physiological signals and ultrasound image data are fused through deep learning algorithms, and a hybrid network of convolutional front-end and spatiotemporal Transformer backbone is used for prediction.
Improve image processing quality, ensure stable and clear images, intelligently predict lesion areas, and enhance the sensitivity and accuracy of early diagnosis.
Smart Images

Figure CN121481973B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical data processing technology, specifically a deep learning-based method for processing medical image data. Background Technology
[0002] Medical imaging data is information about the internal tissues of the human body stored in the form of a digital matrix through medical imaging technologies such as CT, MRI, PET, and ultrasound. It reflects the differences in X-ray absorption, electromagnetic wave characteristics, or metabolic activities of different tissues and is represented in a digital way. It is one of the most important sources of information in modern medicine.
[0003] The Chinese patent application (CN202510038224.X) discloses an artificial intelligence-based intelligent medical data processing system and method. This configuration method includes: employing deep learning-based image processing technology to extract multi-level features from CT scan images of chronic disease patients undergoing follow-up examinations, mining the texture and structural features of bronchiectasis regions in the CT scan images, and performing fine-grained semantic interaction fusion based on core correlation features to achieve a comprehensive understanding of the bronchiectasis state. This allows for intelligent identification of the bronchiectasis type, effectively improving the accuracy of bronchiectasis type identification and providing doctors with more accurate diagnostic information.
[0004] In the field of medical data processing technology, although there are deep learning-based image processing techniques for extracting multi-level features from CT scans of chronic disease patients and uncovering the texture and structural features of the bronchiectasis region in CT scans, in primary healthcare settings, outdated and obsolete ultrasound imaging equipment leads to problems such as low image quality, inconsistent operating standards, and insufficient diagnostic experience. This affects diagnostic accuracy and efficiency, as well as the patient's experience and the quality of medical care. Therefore, there is a need for a medical image data processing method that can use deep learning algorithms to fuse patient physiological signals and ultrasound image data to achieve auxiliary diagnosis. Summary of the Invention
[0005] This invention provides a deep learning-based medical image data processing method, aiming to solve the problem of inaccurate diagnostic results caused by outdated imaging equipment in primary healthcare institutions.
[0006] The technical solution adopted by this invention to solve the above-mentioned technical problems is as follows: A deep learning-based medical image data processing method is provided, comprising:
[0007] Multimodal data is collected, phase is tracked to determine steady state, data is collected synchronously according to the acquisition template, pulse timing is extracted by adaptive threshold peak detection, instantaneous phase is calculated by linear interpolation between adjacent events, and statistics are calculated within a sliding window. The quantitative indicators are compared with preset thresholds, and steady state is determined by combining peak loss rate and abrupt change detection rules. At the same time, the phase consistency between channels is calculated to verify the results.
[0008] A data relationship model is established, and a registration compensation image stabilization sequence is performed. The acquired data is split according to conceptual entities, and the acquisition records are saved to form a data relationship model. Fast global rigidity estimation is performed on the sequence, and affine transformation is applied to complete coarse registration. A multi-scale pyramid dense optical flow network is used to estimate the pixel-level displacement field. The displacement field is applied to the original pixels, and temporal fusion is performed with optical flow confidence and registration residual as weights.
[0009] The system predicts and identifies risky lesions, provides auxiliary guidance and suggestions, and crops and normalizes the registered and compensated short-term stable image sequence into an image tensor. It then maps the tensor to a conditional vector frame by frame and aligns it with the image tensor. A hybrid network of convolutional front-end and spatiotemporal Transformer backbone is used as the prediction model to output a pixel-level risk heat map, a list of candidate lesions and confidence intervals for each output. The structured metadata is then transmitted back to the operator's terminal in real time to obtain the decision basis for real-time scanning guidance.
[0010] As a preferred implementation method, the specific steps for acquiring multimodal data and tracking phase to determine steady state are as follows:
[0011] Based on the patient's imaging locations, list the modalities to be acquired, including ultrasound, ECG, blood oxygenation, respiration, and IMU. Create an acquisition template by specifying the name, settings, corresponding view, and acquisition duration for each modality. Connect the sensors for each modality according to a predetermined topology and deploy hardware for unified triggering of acquisition. Share the time reference and trigger signal at the physical level. Set and lock the acquisition parameters on each acquisition device and write the parameters and device information to the session log to obtain a consistent raw data source and traceable parameter records. Before formal acquisition, perform a short-term synchronization test to calculate the relative delay and jitter statistics for each channel. Use the acquired timing deviation for correction and alignment. Acquire views one by one according to a predefined view sequence. Use an event logging tool to synchronously annotate key moments and bind and mark events with image frames.
[0012] Peak detection is applied to the collected data to extract the pulse peak time series. The event phase is defined by linear interpolation between adjacent peaks, as shown in the formula:
[0013] ,
[0014] Where t represents time. This represents the time of the nth reference event. This represents the time of the (n+1)th reference event. Indicates the phase at time t;
[0015] By calculating statistics within a sliding window, a quantitative characterization of rhythm stability is obtained. The statistics are compared with a preset threshold, and a stability label is output by combining peak loss rate and mutation detection rules. The phase consistency index between different channels is calculated to achieve multimodal redundancy verification, which serves as an alternative judgment basis when a single channel is affected by noise. The steady-state judgment result is used for real-time imaging strategy switching, and unstable periods are recorded for subsequent threshold tuning and model training.
[0016] The acquired raw time-domain signal is first subjected to bandpass filtering and baseline removal. The filtered signal is then normalized in amplitude and subjected to short-time energy extraction to obtain a preprocessed signal with concentrated spectrum, comparable amplitude, and easy peak identification. The preprocessed signal is then used to identify candidate peaks by a method of local maximum search combined with adaptive threshold discrimination. The threshold is adaptively set by the moving median of the short-time background energy. Physiological constraints and time-series verification are applied to the candidate peaks, with a minimum peak interval of ≥300ms set. False peaks with amplitudes lower than the noise baseline are removed. When multiple channels exist in parallel, a channel consistency verification and priority selection mechanism is used to fuse the peak time series. The final peak time is written into the metadata in a unified time-base timestamp format.
[0017] As a preferred implementation method, the specific steps for establishing a data relationship model and registering and compensating for the image-stabilized sequence are as follows:
[0018] By collaborating with clinical professionals to identify target use cases and access boundaries, a list of functional requirements to drive model design was compiled. Requirements were mapped to conceptual entities and ER diagrams were drawn. Collected data was split according to conceptual entities, and a corresponding globally unique identifier was assigned to each entity. A unified time base representation was used in all entities and metadata, and the sampling start time was written into each entity record. Validation was performed on key fields. Foreign keys were used to connect different levels in the relational model to create indexes for high-frequency retrieval fields. A hierarchical storage strategy was adopted, with object storage for large files, relational databases for structured metadata, time series databases for signal indexing, and graph databases for complex tracing queries. Quality control and version management fields were embedded in the model, a query perspective was designed for upper-level applications, and a standardized API was encapsulated using a REST architecture.
[0019] A phase correlation method is used to perform fast global rigidity estimation on each frame of the image data and the reference frame. Affine transformation is applied for initial alignment to eliminate large-scale translation and rotation differences. Then, a multi-scale pyramid dense optical flow network is used to estimate the displacement field of the coarsely registered frame pairs to obtain a dense displacement field describing pixel-level local displacement. By adding spatial smoothing regularization and temporal consistency penalty terms to the displacement field estimation and iteratively implementing constraint optimization, an inverse mapping strategy is used for each frame to apply the displacement field to the original pixels to achieve inverse transformation compensation. In a short-time sliding window, based on the displacement field confidence and registration residual, weighted temporal fusion is performed on the compensated frames to improve the boundary sharpness and lesion visibility of the image sequence.
[0020] For each output stabilized frame, a quality index is calculated, and the displacement statistics and quality index results are written into the metadata. The quality index calculation formula is as follows:
[0021] ,
[0022] in This represents normalized weighted structural similarity, where D represents the organ mask region. Indicates a reference image. This represents the registered image. Represents an exponential function. This represents the displacement penalty coefficient. This represents the scale normalization constant. Indicates the average displacement amplitude;
[0023] When the quality index falls below the preset threshold, a fallback to a conservative strategy or a prompt for resampling is triggered.
[0024] As a preferred implementation method, the specific steps for predicting and identifying risky lesions and providing suggestions are as follows:
[0025] A hybrid network structure with a convolutional front-end and a spatiotemporal Transformer backbone is used as the prediction model. The registered and compensated short-time ultrasound image stabilization sequence is cropped, and the resulting image tensor is used as the model input. Pixel-level segmentation, region-level risk scoring, and case-level confidence estimation are trained in parallel to output a joint prediction result of pixel-level risk heat map, candidate lesion list, and confidence interval. Temperature calibration and Monte Carlo sampling strategies based on uncertainty quantification are introduced to correct the confidence of the prediction results. The pixel-level heat map is mapped into executable probe operation suggestions through regularization. Real-time scanning information guides the operator, improving lesion detection rate and acquisition efficiency.
[0026] The convolutional front end uses lightweight residual blocks to construct a local feature extractor. Each stage consists of a 3×3 convolution, a normalized layer, and SiLU activation. Spatial downsampling is achieved between two stages using convolution with a stride of 2, resulting in a high-dimensional local feature map that is downsampled step by step and retains details. A hierarchical spatiotemporal Transformer module is used as the backbone. Multi-head self-attention is performed in each layer with a fixed 7×7 window. Window movement is performed layer by layer to achieve cross-window information interaction, resulting in a spatiotemporal representation that takes into account both fine-grained interactions within local windows and long-range dependency modeling across windows.
[0027] The beneficial effects of this invention are as follows:
[0028] 1. This invention integrates the patient's physiological signals with ultrasound image data, effectively improving image processing quality. When the patient's physiological state is unstable, the system can automatically adjust to ensure stable and clear images.
[0029] 2. Based on automatic image analysis and combined with physiological signals, this invention can intelligently predict potential lesion areas, thereby improving the sensitivity and accuracy of early diagnosis. Attached Figure Description
[0030] Figure 1 This is a flowchart of a deep learning-based medical image data processing method.
[0031] Figure 2 This is a comparison chart of the effects of a deep learning-based medical image data processing method. Detailed Implementation
[0032] To make the technical means, creative features, objectives, and effects of this invention easier to understand, the invention is further described below with reference to specific embodiments. However, the following embodiments are merely preferred embodiments of this invention and not all of them. Other embodiments obtained by those skilled in the art based on the embodiments described herein without creative effort are all within the protection scope of this invention.
[0033] Example 1, as Figure 1 This is a deep learning-based medical image data processing method, which includes the following steps:
[0034] Collect multimodal data and track the phase to determine steady state;
[0035] Establish a data relationship model and register and compensate for stable image sequences;
[0036] Predict and identify risky lesions, and provide suggestions to assist in guidance.
[0037] The specific implementation steps are as follows: A deep learning-based medical image data processing method, wherein the specific steps for acquiring multimodal data and tracking phase to determine steady state are as follows:
[0038] List the modalities to be acquired based on the patient's imaging locations, including ultrasound, ECG, blood oxygenation, respiration, and IMU. Create an acquisition template by specifying the name, settings, corresponding view, and acquisition duration for each modality. Connect the sensors for each modality according to a predetermined topology and deploy hardware for unified triggering, such as a foot switch. Share the time reference and trigger signal at the physical level. Set and lock acquisition parameters on each acquisition device, such as 50–100fps for video, ≥500Hz for ECG, and 100–200Hz for IMU. Write the parameters and device information to the session log to obtain a consistent raw data source and traceable parameter records. Perform a short-term synchronization test before formal acquisition, calculate the relative delay and jitter statistics for each channel, and use the acquired timing deviation for correction and alignment. Acquire views one by one according to a predefined view sequence. Use an event logging tool to synchronously annotate key moments, such as changes in body position, breath-holding, and patient movements, and bind and mark events with image frames.
[0039] Peak detection is applied to the collected data to extract the pulse peak time series. The event phase is defined by linear interpolation between adjacent peaks, as shown in the formula:
[0040] ,
[0041] Where t represents time. This represents the time of the nth reference event. This represents the time of the (n+1)th reference event. Indicates the phase at time t;
[0042] By calculating statistics such as the median of instantaneous frequency, the circular variance of phase, and the coefficient of variation of frequency within a sliding window, a quantitative characterization of rhythm stability is obtained. The statistics are compared with preset thresholds, and a stability label is output by combining peak loss rate and abrupt change detection rules. The phase consistency index between different channels is calculated to realize multimodal redundancy verification, which serves as an alternative judgment basis when a single channel is affected by noise. The steady-state judgment result is used for real-time imaging strategy switching, and unstable periods are recorded for subsequent threshold tuning and model training.
[0043] Specifically, the method of applying peak detection to extract pulse peak time series from the acquired data involves first performing bandpass filtering and baseline removal on the acquired raw time domain signal, then performing amplitude normalization and short-time energy extraction on the filtered signal to obtain a preprocessed signal with concentrated spectrum, comparable amplitude, and easy peak identification. The preprocessed signal is then used to identify candidate peaks by a method of local maximum search combined with adaptive threshold discrimination. The threshold is adaptively set by the moving median of the short-time background energy. Physiological constraints and time series verification are applied to the candidate peaks, with a minimum peak interval of ≥300ms set. False peaks with amplitudes lower than the noise baseline are removed. When multiple channels exist in parallel, a channel consistency verification and priority selection mechanism is used to fuse the peak time series. Finally, the peak time is written into the metadata in a unified time base timestamp format.
[0044] The specific steps for establishing a data relationship model and registering and compensating stable image sequences are as follows:
[0045] By collaborating with clinical professionals to identify target use cases and access boundaries, a list of functional requirements to drive model design was compiled. Requirements were mapped to conceptual entities and ER diagrams were drawn. Collected data was split according to conceptual entities, and a corresponding globally unique identifier was assigned to each entity. A unified time base representation was used in all entities and metadata, and the sampling start time was written into each entity record. Validation was performed on key fields, such as collection parameters, equipment information, event streams, and quality inspection results. Foreign keys were used to connect different levels in the relational model to create indexes for high-frequency retrieval fields. A hierarchical storage strategy was adopted, with object storage for large files, relational databases for structured metadata, time series databases for signal indexing, and graph databases for complex traceability queries. Quality control and version management fields were embedded in the model, a query perspective was designed for upper-level applications, and a standardized API was encapsulated using a REST architecture.
[0046] A phase correlation method is used to perform fast global rigidity estimation on each frame of the image data and a reference frame. Affine transformation is applied for initial alignment to eliminate large-scale translation and rotation differences. Specifically, the phase correlation method involves performing Fourier transform on the two images, calculating the normalized cross spectrum, and performing an inverse transform. The resulting temporal peak position is used to directly read the translational displacement between the images, achieving fast and sub-pixel accuracy rigid alignment under intensity changes. Then, a multi-scale pyramid dense optical flow network is used to estimate the displacement field of the coarsely registered frame pairs, obtaining a dense displacement field describing pixel-level local displacement. The dense optical flow network specifically estimates and iteratively refines the displacement of each pixel at multiple scales of the image pyramid, from coarse to fine. It generates a high-precision dense displacement field by using features extracted by deep neural networks and optical flow constraints. By adding spatial smoothing regularization and temporal consistency penalty terms to the displacement field estimation and achieving constraint optimization through iterative minimization, the displacement field is applied to the original pixels for each frame to achieve inverse transformation compensation using a reverse mapping strategy. In a short-time sliding window, the compensation frames are weighted and fused in the temporal domain based on the displacement field confidence and registration residual, thereby improving the boundary sharpness and lesion visibility of the image sequence.
[0047] For each output stabilized frame, a quality index is calculated, and the displacement statistics and quality index results are written into the metadata. The quality index calculation formula is as follows:
[0048] ,
[0049] in This represents normalized weighted structural similarity, where D represents the organ mask region. Indicates a reference image. This represents the registered image. Represents an exponential function. Indicates the displacement penalty coefficient. This represents the scale normalization constant. Indicates the average displacement amplitude;
[0050] When the quality index falls below the preset threshold, a fallback to a conservative strategy or a prompt for resampling will be triggered.
[0051] The specific steps for predicting and identifying risky lesions and providing suggestions are as follows:
[0052] A hybrid network structure with a convolutional front-end and a spatiotemporal Transformer backbone is used as the prediction model. The short-time sequence of registered and compensated stable ultrasound images is cropped, and the resulting image tensor is used as the model input. Pixel-level segmentation, region-level risk scoring, and case-level confidence estimation are trained in parallel to output a joint prediction result of pixel-level risk heat map, candidate lesion list, and confidence interval. Temperature calibration and Monte Carlo sampling strategies based on uncertainty quantification are introduced to correct the confidence of the prediction results. The pixel-level heat map is regularized and mapped into actionable probe operation suggestions, such as movement direction, angle, and pressure prompts. Real-time scanning information guides the operator, improving lesion detection rate and acquisition efficiency.
[0053] Specifically, the hybrid network structure of the convolutional front end plus the spatiotemporal Transformer backbone uses lightweight residual blocks to construct a local feature extractor in the convolutional front end. Each level consists of a 3×3 convolution, a normalization layer, and SiLU activation. Spatial downsampling is achieved between two levels using convolution with a stride of 2, resulting in a high-dimensional local feature map that is downsampled step by step and retains details. A hierarchical spatiotemporal Transformer module is used as the backbone. Multi-head self-attention is performed in each layer with a fixed 7×7 window. Window movement is performed layer by layer to achieve cross-window information interaction, resulting in a spatiotemporal representation that takes into account both fine-grained interactions within local windows and long-range dependency modeling across windows.
[0054] like Figure 2 This is a comparison chart of the effects of a deep learning-based medical image data processing method. The horizontal axis lists key performance indicators, and the vertical axis represents the exemplified performance scores, ranging from 0 to 100. Higher values indicate better performance. The aim is to intuitively demonstrate the expected improvement of the invention in key capabilities compared to typical prior art.
[0055] Example 2, based on Example 1 above, presents a deep learning-based medical image data processing method for primary healthcare image-assisted diagnosis scenarios, specifically as follows:
[0056] Step 1: At the start of acquisition, a cross-device consensus marker is generated by combining audio pulses with screen flashing. A frame-level synchronization reference is obtained by software-level cross-correlation correction. Wavelet packet transform is performed on ECG and photoelectric signals, and peak detection is performed based on energy mutation to obtain the pulse peak time series. The instantaneous phase is calculated between adjacent peaks using the Hilbert analytical method. Within a sliding window, a stability index is constructed using spectral entropy and the median of instantaneous frequency to quantify the steady state of the rhythm. Phase lock value is calculated for multi-channel phase to achieve redundancy verification. The steady state judgment is used as the state machine input to automatically adjust the imaging parameter settings, such as frame fusion intensity and denoising level, to obtain adaptively optimized image quality during runtime. The timing deviation, peak time, phase curve, and judgment result are recorded in the sidecar metadata for offline calibration.
[0057] Step 2: Entity modeling is performed based on the established use cases. Verifiable metadata is formed by recording the sampling start time and collection parameters for each entity. A rigid registration scheme based on local feature descriptors is adopted, key points are extracted using SuperPoint, and global alignment is obtained by solving affine transformations using RANSAC. Dense displacement fields are estimated by B-spline free deformation optimization based on mutual information. Time consistency constraints are introduced into the energy term to obtain a smooth displacement sequence. Weighted median fusion is applied to the compensated frames, and local contrast enhancement is applied to obtain a stable image sequence. A comprehensive quality score is constructed using mutual information as a similarity term and an exponential penalty is applied to the average displacement to automatically determine the backoff conditions.
[0058] Step 3: The registered and compensated stable short sequence input is subjected to layer-by-layer 3D convolution to extract volumetric spatial features, resulting in a high-dimensional volumetric representation that takes into account both local texture and organ morphology. Causal convolutional layers are then concatenated on the spatial features to recursively model the time axis. Phase conditional gating is introduced in the temporal module to map the frame-level physiological phase to channel scale and offset factor and apply it to the temporal features. By setting two outputs, segmentation and regression, at the feature fusion point, pixel-level risk heatmaps and region-level risk scores are generated respectively. During inference, Monte Carlo sampling is used to estimate uncertainty and temperature scaling is used to calibrate the probability, resulting in calibrated confidence intervals for clinical threshold decision-making. The regularized heatmap is mapped to executable probe operation suggestions and written into session metadata to obtain outputs that can drive real-time scanning guidance.
[0059] The embodiments of the present invention described above are subject to modification and change of method by those skilled in the art without departing from the embodiments and broader aspects of the present invention. The appended claims are intended to include all such modifications and changes of method that do not depart from the present invention.
Claims
1. A deep learning-based method for processing medical image data, characterized in that, include: Multimodal data is collected, phase is tracked to determine steady state, data is collected synchronously according to the acquisition template, pulse timing is extracted by adaptive threshold peak detection, instantaneous phase is calculated by linear interpolation between adjacent events, and statistics are calculated within a sliding window. The quantitative indicators are compared with preset thresholds, and steady state is determined by combining peak loss rate and abrupt change detection rules. At the same time, the phase consistency between channels is calculated to verify the results. A data relationship model is established, and a registration compensation image stabilization sequence is performed. The acquired data is split according to conceptual entities, and the acquisition records are saved to form a data relationship model. Fast global rigidity estimation is performed on the image sequence, and affine transformation is applied to complete coarse registration. A multi-scale pyramid dense optical flow network is used to estimate the pixel-level displacement field. The displacement field is applied to the original pixels, and temporal fusion is performed with optical flow confidence and registration residual as weights. The system predicts and identifies risky lesions, provides auxiliary guidance and suggestions, and crops and normalizes the registered and compensated short-term stable image sequence into an image tensor. It then maps the tensor to a conditional vector frame by frame and aligns it with the image tensor. A hybrid network of convolutional front-end and spatiotemporal Transformer backbone is used as the prediction model to output a pixel-level risk heat map, a list of candidate lesions and confidence intervals for each output. The structured metadata is then transmitted back to the operator's terminal in real time to obtain the decision basis for real-time scanning guidance.
2. The medical image data processing method based on deep learning according to claim 1, characterized in that: The specific steps for acquiring multimodal data and tracking phase to determine steady state are as follows: Based on the patient's imaging locations, list the modalities to be acquired, including ultrasound, ECG, blood oxygenation, respiration, and IMU. Create an acquisition template by specifying the name, settings, corresponding view, and acquisition duration for each modality. Connect the sensors for each modality according to a predetermined topology and deploy hardware for unified triggering of acquisition. Share the time reference and trigger signal at the physical level. Set and lock the acquisition parameters on each acquisition device and write the parameters and device information to the session log to obtain a consistent raw data source and traceable parameter records. Before formal acquisition, perform a short-term synchronization test to calculate the relative delay and jitter statistics for each channel. Use the acquired timing deviation for correction and alignment. Acquire views one by one according to a predefined view sequence. Use an event logging tool to synchronously annotate key moments and bind and mark events with image frames.
3. The medical image data processing method based on deep learning according to claim 2, characterized in that: The specific steps for acquiring multimodal data and tracking phase to determine steady state also include: Peak detection is applied to the collected data to extract the pulse peak time series. The event phase is defined by linear interpolation between adjacent peaks, as shown in the formula: , Where t represents time. This represents the time of the nth reference event. This represents the time of the (n+1)th reference event. This indicates the phase at time t.
4. The medical image data processing method based on deep learning according to claim 2, characterized in that: The specific steps for acquiring multimodal data and tracking phase to determine steady state also include: By calculating statistics within a sliding window, a quantitative characterization of rhythm stability is obtained. The statistics are compared with a preset threshold, and a stability label is output by combining peak loss rate and mutation detection rules. The phase consistency index between different channels is calculated to achieve multimodal redundancy verification, which serves as an alternative judgment basis when a single channel is affected by noise. The steady-state judgment result is used for real-time imaging strategy switching, and unstable periods are recorded for subsequent threshold tuning and model training.
5. A medical image data processing method based on deep learning according to claim 2, characterized in that: The specific steps for acquiring multimodal data and tracking phase to determine steady state also include: The acquired raw time-domain signal is first subjected to bandpass filtering and baseline removal. The filtered signal is then normalized in amplitude and subjected to short-time energy extraction to obtain a preprocessed signal with a concentrated spectrum, comparable amplitude, and easy peak identification. The preprocessed signal is then used to identify candidate peaks by a method combining local maximum search and adaptive threshold discrimination. The threshold is adaptively set by the moving median of the short-time background energy. Physiological constraints and time-series verification are applied to the candidate peaks, with a peak interval of ≥300ms set. False peaks with amplitudes lower than the noise baseline are removed. When multiple channels exist in parallel, a channel consistency verification and priority selection mechanism is used to fuse the peak time series. The peak timestamps are written into the metadata in a unified time base timestamp format.
6. The medical image data processing method based on deep learning according to claim 1, characterized in that: The specific steps for establishing a data relationship model and registering and compensating stable image sequences are as follows: By collaborating with clinical professionals to identify target use cases and access boundaries, a list of functional requirements to drive model design was compiled. Requirements were mapped to conceptual entities and ER diagrams were drawn. Collected data was split according to conceptual entities, and a globally unique identifier was assigned to each entity. A unified time base representation was used across all entities and metadata, and the sampling start time was written into each entity record. Key fields were validated. Foreign keys were used to connect different levels in the relational model, indexes were created for frequently retrieved fields, and a hierarchical storage strategy was adopted: object storage for large files, relational databases for structured metadata, time-series databases for signal indexing, and graph databases for complex tracing queries. Quality control and version management fields were embedded in the model, a query perspective was designed for upper-level applications, and a standardized API was encapsulated using a REST architecture.
7. The medical image data processing method based on deep learning according to claim 6, characterized in that: The specific steps for establishing a data relationship model and registering and compensating stable image sequences also include: A phase correlation method is used to perform fast global rigidity estimation for each frame of the image data and the reference frame. Affine transformation is applied for initial alignment to eliminate large-scale translation and rotation differences. Then, a multi-scale pyramid dense optical flow network is used to estimate the displacement field of the coarsely registered frame pairs to obtain a dense displacement field describing pixel-level local displacement. By adding spatial smoothing regularization and temporal consistency penalty terms to the displacement field estimation and iteratively implementing constraint optimization, an inverse mapping strategy is used for each frame to apply the displacement field to the original pixels to achieve inverse transformation compensation. In a short-time sliding window, based on the displacement field confidence and registration residual, weighted temporal fusion is performed on the compensated frames to improve the boundary sharpness and lesion visibility of the image sequence.
8. A medical image data processing method based on deep learning according to claim 6, characterized in that: The specific steps for establishing a data relationship model and registering and compensating stable image sequences also include: For each output stabilized frame, a quality index is calculated, and the displacement statistics and quality index results are written into the metadata. The quality index calculation formula is as follows: , in This represents normalized weighted structural similarity, where D represents the organ mask region. Indicates a reference image. This represents the registered image. Represents an exponential function. Indicates the displacement penalty coefficient. This represents the scale normalization constant. Indicates the average displacement amplitude; When the quality index falls below the preset threshold, a fallback to a conservative strategy or a prompt for resampling is triggered.
9. The medical image data processing method based on deep learning according to claim 1, characterized in that: The specific steps for predicting and identifying risky lesions and providing suggestions are as follows: A hybrid network structure with a convolutional front-end and a spatiotemporal Transformer backbone is used as the prediction model. The registered and compensated short-time ultrasound image stabilization sequence is cropped, and the resulting image tensor is used as the model input. Pixel-level segmentation, region-level risk scoring, and case-level confidence estimation are trained in parallel to output a joint prediction result of pixel-level risk heatmap, candidate lesion list, and confidence interval. Temperature calibration and Monte Carlo sampling strategies based on uncertainty quantification are introduced to correct the confidence of the prediction results. The pixel-level heatmap is mapped into actionable probe operation suggestions through regularization, and real-time scanning information guides the operator, improving lesion detection rate and acquisition efficiency.
10. A medical image data processing method based on deep learning according to claim 9, characterized in that: The specific steps for predicting and identifying risky lesions and providing guidance include: The convolutional front end uses lightweight residual blocks to construct a local feature extractor. Each stage consists of a 3×3 convolution, a normalized layer, and SiLU activation. Spatial downsampling is achieved between two stages using convolution with a stride of 2, resulting in a high-dimensional local feature map that is downsampled step by step and retains details. A hierarchical spatiotemporal Transformer module is used as the backbone. Multi-head self-attention is performed in each layer with a fixed 7×7 window. Window movement is performed layer by layer to achieve cross-window information interaction, resulting in a spatiotemporal representation that takes into account both fine-grained interactions within local windows and long-range dependency modeling across windows.