Medical image analysis method for early screening of cardiovascular disease

By using the HeartCare-Net hybrid architecture neural network, which combines multi-scale dilated convolution and axial attention modules, the problems of weak global modeling ability, heavy computational burden and insufficient multi-scale feature capture in early screening of cardiovascular diseases are solved, and efficient and accurate early lesion detection and diagnosis are achieved.

CN122289147APending Publication Date: 2026-06-26NANTONG COLLEGE OF SCIENCE & TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG COLLEGE OF SCIENCE & TECHNOLOGY
Filing Date
2026-03-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Current early screening methods for cardiovascular diseases rely on manual interpretation by physicians, which is inefficient and highly subjective. Deep learning-based image analysis methods have insufficient global modeling capabilities, complex attention mechanisms, and inadequate multi-scale feature extraction, resulting in low early lesion detection rates and making them difficult to deploy efficiently in real-world clinical settings.

Method used

We employ a HeartCare-Net hybrid neural network architecture, combining convolutional neural networks and Transformers. We enhance feature extraction through multi-scale dilated convolutions and axial attention modules, reduce computational complexity using a proxy attention mechanism, and aggregate multi-scale features for disease probability prediction through a pyramid pooling strategy.

Benefits of technology

It significantly improves the detection rate and diagnostic accuracy of early lesions, achieves lightweight and efficient reasoning, has good clinical applicability and deployability, and is suitable for large-scale population screening and primary healthcare auxiliary diagnosis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289147A_ABST
    Figure CN122289147A_ABST
Patent Text Reader

Abstract

This invention discloses a medical image analysis method for early screening of cardiovascular diseases, belonging to the field of medical image analysis technology. The method includes acquiring chest X-ray medical image data; inputting the preprocessed images into a hybrid architecture neural network model for feature extraction and analysis; employing a multi-scale dilated convolution structure and an axial attention module to enhance the extraction capability of pathological features at different scales and directions; aggregating multi-scale features through a pyramid pooling strategy; and outputting early screening results for cardiovascular diseases. This invention systematically solves the problems of existing methods in early screening of cardiovascular diseases, such as weak global modeling capabilities, heavy computational burden, insufficient multi-scale feature capture, and incomplete utilization of prior knowledge of medical image structures. Therefore, it significantly improves the detection rate and diagnostic accuracy of early lesions while achieving lightweight and efficient inference, providing reliable technical support for large-scale population screening and primary healthcare auxiliary diagnosis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image analysis technology, and in particular to medical image analysis methods for early screening of cardiovascular diseases. Background Technology

[0002] Cardiovascular disease is one of the major threats to human health. Early detection and diagnosis are crucial for reducing its mortality and disability rates. Chest X-ray, as a non-invasive, low-cost, and widely used medical imaging technique, effectively reflects cardiovascular-related pathological features such as heart size, shape, and pulmonary vascular texture, making it the preferred screening method for cardiovascular diseases. However, traditional chest X-ray diagnosis relies heavily on manual interpretation by professional physicians, which presents several problems: First, the training period for professional cardiovascular radiologists is long and their numbers are limited, making it difficult to meet the needs of large-scale screening; second, manual interpretation is easily influenced by subjective factors, making diagnostic consistency difficult to guarantee; third, the X-ray manifestations of early cardiovascular diseases are often subtle and difficult to accurately analyze with the naked eye; and finally, manual interpretation is inefficient in the face of massive amounts of chest X-ray data. However, due to the development of deep learning technology, medical image analysis methods based on convolutional neural networks have made significant progress in the field of disease diagnosis in recent years. However, existing methods still have many shortcomings when applied to the early screening of cardiovascular diseases: traditional convolutional neural networks mainly focus on local feature extraction and are difficult to effectively model the global contextual relationships in images. Cardiovascular diseases such as cardiac hypertrophy and pulmonary edema often require a comprehensive judgment based on the overall cardiac contour and surrounding tissue structures. Existing attention mechanisms have high computational complexity and are difficult to achieve efficient inference while maintaining accuracy, which is not conducive to deployment and application in grassroots institutions with limited medical resources. In addition, cardiovascular diseases vary greatly among different patients, and lesion sizes vary. Existing methods lack effective multi-scale feature capture capabilities and have a low detection rate for early small lesions.

[0003] However, current common solutions have many drawbacks, including: existing early screening for cardiovascular diseases mainly relies on physicians manually interpreting chest X-rays, which is inefficient and highly subjective; while existing deep learning-based image analysis methods often suffer from low early lesion detection rates and difficulty in efficiently deploying models in real clinical environments due to insufficient global modeling capabilities of pure convolutional neural networks, complex attention mechanism calculations, insufficient multi-scale feature extraction, and failure to fully incorporate the row and column structure characteristics of medical images. Summary of the Invention

[0004] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0005] In view of the problems existing in the current medical image analysis methods for early screening of cardiovascular diseases, this invention is proposed.

[0006] Therefore, the purpose of this invention is to provide a medical image analysis method for early screening of cardiovascular diseases. This method is applicable to solving the problems of low efficiency and high subjectivity in existing early screening of cardiovascular diseases, which mainly relies on physicians to manually interpret chest X-rays. Existing deep learning-based image analysis methods often suffer from low early lesion detection rates and difficulty in efficient deployment in actual clinical environments due to insufficient global modeling capabilities of pure convolutional neural networks, complex attention mechanism calculations, insufficient multi-scale feature extraction, and failure to fully integrate the row and column structure characteristics of medical images.

[0007] To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, embodiments of the present invention provide a medical image analysis method for early screening of cardiovascular diseases, which includes acquiring chest X-ray medical image data and performing preprocessing; inputting the preprocessed image into a HeartCare-Net hybrid architecture neural network model for feature extraction and analysis; employing a multi-scale dilated convolution structure and an axial attention module to enhance the extraction capability of pathological features at different scales and directions; aggregating multi-scale features through a pyramid pooling strategy and inputting them into a classification head for disease probability prediction; and outputting early screening results for cardiovascular diseases.

[0008] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the preprocessing includes size scaling and pixel value normalization; the HeartCare-Net model adopts a hybrid encoder combining convolutional neural networks (CNN) and Transformers, and reduces computational complexity through a proxy attention mechanism.

[0009] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the HeartCare-Net model includes: a CNN embedding layer for preliminary feature extraction and downsampling; a hybrid encoder comprising multiple stages, each stage including a local feature block, a downsampling layer, and a Transformer block; a multi-scale attention enhancement module, including a multi-scale dilated attention module and an axial attention module; a pyramid pooling module for multi-scale feature aggregation; and a classification head for outputting disease prediction probabilities.

[0010] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the local feature blocks adopt a multi-scale dilated convolution structure, including parallel 1×1 convolution, 3×3 convolution with a dilation rate of 1 and 3×3 convolution with a dilation rate of 2, and the features of each branch are fused by 1×1 convolution after being spliced ​​in the channel dimension.

[0011] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the Transformer block adopts a surrogate attention mechanism, which reduces the computational complexity of self-attention by introducing surrogate tokens. Specifically, it includes: aggregating query vectors into a fixed number of surrogate tokens; calculating the attention weight between the surrogate tokens and all key vectors; weighting and aggregating the value vectors based on the weights to generate surrogate feature representations; and calculating the attention weight between the original query and the surrogate features and aggregating them again.

[0012] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the axial attention module calculates attention along the image height and width directions respectively, and then merges the outputs of the two directions by splicing them together in the channel dimension.

[0013] As a preferred embodiment of the medical image analysis method for early screening of cardiovascular diseases described in this invention, the pyramid pooling module pools the feature map into three resolutions of 1×1, 2×2 and 4×4 respectively through parallel adaptive average pooling, and the pooled features are flattened and stitched together in the channel dimension.

[0014] Secondly, to further address the aforementioned technical problems, the present invention provides a medical image analysis system for early screening of cardiovascular diseases, comprising: a data acquisition module for acquiring chest X-ray medical image data and performing preprocessing; a feature extraction module for inputting the preprocessed image into a HeartCare-Net hybrid architecture neural network model for feature extraction and analysis; an extraction enhancement module for enhancing the extraction capability of pathological features of different scales and directions by employing a multi-scale dilated convolution structure and an axial attention module; a prediction input module for aggregating multi-scale features through a pyramid pooling strategy and inputting them into a classification head for disease probability prediction; and a result output module for outputting early screening results for cardiovascular diseases.

[0015] Thirdly, embodiments of the present invention provide a computer device, including a memory and a processor, wherein the memory stores a computer program, wherein: when the computer program is executed by the processor, it implements any step of the medical image analysis method for early screening of cardiovascular diseases as described in the first aspect of the present invention.

[0016] Fourthly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the medical image analysis method for early screening of cardiovascular diseases as described in the first aspect of the present invention.

[0017] The beneficial effects of this invention are as follows: By constructing a collaborative HeartCare-Net hybrid architecture and innovatively introducing a proxy attention mechanism to reduce computational complexity, this invention combines multi-scale dilated convolution and axial attention modules to achieve adaptive extraction of pathological features of different sizes and directions. Finally, by fusing multi-scale contextual information through a pyramid pooling strategy, this invention systematically solves the problems of weak global modeling ability, heavy computational burden, insufficient multi-scale feature capture, and incomplete utilization of prior knowledge of medical image structure in the early screening of cardiovascular diseases in existing methods. Thus, while significantly improving the detection rate and diagnostic accuracy of early lesions, this invention achieves lightweight and efficient inference, and has good clinical applicability, deployability, and generalization ability, providing reliable technical support for large-scale population screening and primary healthcare auxiliary diagnosis. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein: Figure 1 This is a flowchart illustrating the implementation of the present invention in Example 1.

[0019] Figure 2 This is a technical structure diagram of the present invention in Example 1. Detailed Implementation

[0020] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0021] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0022] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0023] Example 1 Reference Figure 1 and Figure 2 This is the first embodiment of the present invention, which provides a medical image analysis method for early screening of cardiovascular diseases, including the following steps: S1: Acquire chest X-ray medical image data and perform preprocessing.

[0024] Preferably, the preprocessing includes size scaling and pixel value normalization.

[0025] Specifically, the purpose of this step is to transform raw medical image data from different source devices into a standardized format suitable for deep learning model processing.

[0026] Data acquisition: The patient's digital chest X-ray is acquired through the medical image archiving and communication system interface or standard image file reading method. The image is usually a single-channel grayscale image.

[0027] Size scaling: The acquired original image is uniformly scaled to a preset fixed size using an image interpolation algorithm (such as bilinear interpolation). This operation aims to eliminate image resolution differences caused by different devices or shooting protocols, and ensure the consistency of model input. The specific value of the fixed size can be set according to the balance between computing resources and accuracy, and its selection does not affect the essence of the present invention.

[0028] Pixel value normalization: This involves standardizing the pixel intensity values ​​of a scaled image. Typically, pixel values ​​are mapped from their original range (e.g., 0-255 or a range relevant to DICOM devices) to a specific numerical interval (e.g., [-1, 1] or [0, 1]) through a linear transformation. A common calculation method is: Normalized pixel value = (Original pixel value - Mean) / Standard deviation, or simple linear scaling can be performed. This operation helps accelerate the convergence of the model training process and improves the numerical stability of the training.

[0029] After preprocessing, the image is converted into a three-dimensional tensor (height, width, number of channels), with 1 channel, and used as input for subsequent steps.

[0030] For example, a batch of DICOM format chest X-ray images from the hospital's Picture Archiving and Communication System (PACS) is obtained. First, the image pixel array and window width and level information are read and initially adjusted. Then, all images are uniformly scaled to a fixed size of 512×512 pixels using bilinear interpolation. Next, the pixel mean and standard deviation of the entire training dataset are calculated, and each image is normalized based on (pixel value - mean) / standard deviation to make the pixel value distribution close to zero mean and unit variance. Finally, the processed single-channel image is converted into a tensor of dimension [1, 512, 512] as the standard input of the model.

[0031] S2: Input the preprocessed image into the HeartCare-Net hybrid architecture neural network model for feature extraction and analysis.

[0032] Preferably, the HeartCare-Net model employs a hybrid encoder combining a convolutional neural network (CNN) and a Transformer, and reduces computational complexity through a proxy attention mechanism.

[0033] Specifically, the HeartCare-Net model includes: CNN embedding layers are used for initial feature extraction and downsampling; The hybrid encoder consists of multiple stages, each stage including a local feature block, a downsampling layer, and a Transformer block in sequence. The multi-scale attention enhancement module includes a multi-scale dilatational attention module and an axial attention module; The pyramid pooling module is used for multi-scale feature aggregation. The classification header is used to output the disease prediction probability.

[0034] Furthermore, the local feature blocks adopt a multi-scale dilated convolution structure, including parallel 1×1 convolutions, 3×3 convolutions with a dilation rate of 1, and 3×3 convolutions with a dilation rate of 2. The features of each branch are concatenated in the channel dimension and then fused by 1×1 convolutions.

[0035] Furthermore, the Transformer block employs a proxy attention mechanism, which reduces the computational complexity of self-attention by introducing proxy tokens. Specifically, this includes: Aggregate the query vectors into a fixed number of proxy tokens; Calculate the attention weights between the proxy token and all key vectors; Based on this weight, the value vector is weighted and aggregated to generate a proxy feature representation; Calculate the attention weights between the original query and the proxy features and aggregate them again.

[0036] Specifically, the axial attention module calculates attention along the image height and width directions respectively, and then merges the outputs of the two directions by concatenating them in the channel dimension.

[0037] Furthermore, the pyramid pooling module pools the feature maps into three resolutions of 1×1, 2×2, and 4×4 through parallel adaptive average pooling. The pooled features are then flattened and stitched together along the channel dimension.

[0038] Specifically, the HeartCare-Net model performs the transformation from the original image to high-level semantic features. This model is a hierarchical, multi-stage deep network, and its main sub-modules and their functions are as follows: CNN embedding layers: Function: Performs preliminary feature extraction and spatial downsampling, quickly converting a high-resolution input image into a feature map with more feature channels but lower spatial resolution.

[0039] Implementation: It typically consists of several levels of convolutional operations. Each level may contain: a convolutional kernel (e.g., 3×3) with a stride greater than 1 to achieve downsampling; a batch normalization layer for stabilizing training; and a non-linear activation function (e.g., ReLU). Through several such operations, the spatial size of the image is significantly reduced while abstracting the initial local feature patterns.

[0040] Hybrid encoder (composed of multiple stages connected in series): Each stage is responsible for extracting features at a specific level of abstraction, typically containing three sequentially connected sub-modules: Local feature blocks: Function: Fuse multi-scale local contextual information within a given receptive field.

[0041] Implementation: A parallel branching structure is adopted, typically containing at least three branches: (a) a 1×1 standard convolution for cross-channel information integration and dimensionality reduction; (b) a 3×3 dilated convolution with a small dilation rate (e.g., r=1) for capturing basic local neighborhood features; and (c) a 3×3 dilated convolution with a large dilation rate (e.g., r=2) for expanding the receptive field and capturing a wider range of context while keeping the parameters constant.

[0042] The output feature maps of each branch are concatenated along the channel dimension, and then fused and the number of channels is adjusted through a 1×1 convolution. This module usually introduces residual connections to directly add the input to the fused output in order to alleviate the gradient vanishing problem in deep networks.

[0043] Downsampling layer: Function: Further reduce the spatial resolution of the feature maps and increase the number of feature channels, enabling the network to focus on more semantic global information.

[0044] Implementation: Typically, max pooling with a stride of 2 is used, or convolution with a stride of 2 is used.

[0045] Transformer block (using proxy attention mechanism): Function: Models long-range dependencies between all spatial locations in the feature map and captures the global context.

[0046] Implementation (core is proxy attention): Projection: The input features are transformed linearly to generate three sets of vectors: query (Q), key (K), and value (V).

[0047] Proxy generation: The query vector Q is compressed into a fixed number (K, where K is much smaller than the number of feature positions N) of proxy tokens through an aggregation operation (such as adaptive average pooling).

[0048] Proxy-key interaction: Calculate the similarity between these K proxy tokens and all N key vectors (K) to obtain the attention weight matrix (size K×N).

[0049] First aggregation: The value vector (V) is weighted and summed using the weights mentioned above to generate K surrogate features.

[0050] Query-proxy interaction: Calculate the similarity between the original N query vectors (Q) and K proxy features to obtain the second attention weight matrix (size N×K).

[0051] Final aggregation: The surrogate features are weighted and summed using a second weight matrix, and the enhanced feature representation of each original position is output.

[0052] Standardization and Feedforward: The attention output above is normalized by the layer, then transformed nonlinearly by a multilayer perceptron (usually containing two linear layers and an activation function), and finally connected again using residual connections.

[0053] Advantages: It transforms the standard self-attention O(N) 2 The computational complexity is reduced to O(KN), significantly improving computational efficiency with almost no performance loss.

[0054] Preferably, by constructing a hierarchical CNN-Transformer hybrid encoder, a radiological diagnostic cognitive path from local detail perception to global relational reasoning is creatively simulated. This enables efficient collaborative modeling of local pathological signs (such as texture abnormalities) and overall anatomical structures (such as the spatial relationship between the heart outline and the lung background), thereby overcoming the inherent limitations of traditional single-architecture models in global context understanding or local detail preservation. This provides a more discriminative deep feature representation for comprehensive imaging assessment of cardiovascular diseases. By introducing a fixed number of proxy tokens as the hub for global information interaction, the computational complexity of standard self-attention is reduced from the quadratic level of the sequence length to the linear level. Without losing global modeling capabilities, the computational and memory overhead of the model is significantly reduced. This allows the Transformer module, which includes powerful global context modeling capabilities, to be efficiently integrated into models for high-resolution medical images, providing key technical feasibility for real-time or near-real-time analysis in resource-constrained clinical environments.

[0055] Preferably, by deploying convolutional kernels with different dilation rates in parallel in the early stages of the network, a multi-receptive-field feature extraction pathway is explicitly constructed, enabling the network to simultaneously capture pathological features at different spatial scales, from fine punctate calcifications to diffuse patchy shadows. This significantly enhances the model's robustness to changes in lesion size, which directly improves the sensitivity and detection rate for early, atypical, or small cardiovascular lesions, and solves the shortcomings of single-scale convolutional kernels in dealing with lesion diversity.

[0056] For example, taking an image with an input size of [1, 512, 512] as an example, it first passes through a CNN embedding layer (such as two 3×3 convolutional layers with a stride of 2), and the output size becomes a feature map of [64, 128, 128]. Then it enters the first stage of the hybrid encoder: the feature map first passes through a local feature block, and its three parallel branches (1×1 convolution, 3×3 convolution with dilation of 1, and 3×3 convolution with dilation of 2) extract features respectively and concatenate them in the channel dimension, and then fuse them through 1×1 convolution to output features of [64, 128, 128]. Then it is downsampled to [64, 64, 64] through max pooling. Finally, the feature map is reshaped into a sequence form and input into the Transformer block, where the proxy attention mechanism aggregates the query vectors in the sequence into 16 proxy tokens. Through two efficient attention weighted aggregations, a feature map of the same size that incorporates global context information is output.

[0057] S3: Employs a multi-scale dilated convolution structure and an axial attention module to enhance the extraction capability of pathological features at different scales and directions.

[0058] Specifically, the multi-scale expanded attention module: Function: Adaptively focuses on suspected lesion areas of different spatial extents, solving the problem of lesion scale inconsistencies.

[0059] Implementation: A multi-head attention mechanism is adopted, which divides the input features into multiple "heads" in the channel. The key design is that different heads embed dilated convolutions with different dilation rates in the depth-separable convolutions or self-attention calculations used by different heads. This makes each head have a different effective receptive field when calculating the correlation between features, thus focusing on small, medium or large area features respectively. The outputs of each head are finally spliced ​​and fused in the channel dimension.

[0060] Axial attention module: Function: Effectively captures structural features in medical images that extend along specific anatomical directions (such as rib orientation and vascular texture).

[0061] Implementation: Decompose the two-dimensional global attention into two one-dimensional sequence attention calculations.

[0062] Height-direction attention: Flatten the input feature map in the width dimension, treat each row of pixels as a sequence, and calculate attention along the height direction (between rows).

[0063] Width-direction attention: Flatten the input feature map in the height dimension, treat each column of pixels as a sequence, and calculate attention along the width direction (between columns).

[0064] Fusion: The two feature maps, which have been enhanced with height and width attention respectively, are concatenated along the channel dimension and fused through a linear projection layer or a 1×1 convolution.

[0065] Preferably, this module decouples the two-dimensional global attention calculation into two one-dimensional sequential attentions in the height and width directions. This is a dedicated design for image grid structure and medical anatomical priors, which can efficiently and specifically model long-range dependencies along specific anatomical directions (such as blood vessel orientation, rib arrangement, and diaphragm position). This enhances the model's ability to analyze structural anomalies (such as aortic tortuosity and pulmonary blood redistribution), thereby increasing the domain relevance of feature representation and improving the interpretability and specificity of diagnostic decisions.

[0066] For example, the high-level feature map output by the encoder (e.g., size [256, 16, 16]) is first input into a multi-scale dilated attention module: the feature is divided into four heads along the channel. Two heads use depthwise convolutions with a dilation rate of 1 to focus on the neighboring region when calculating attention, while the other two heads use depthwise convolutions with a dilation rate of 3 to focus on a wider range of context. The outputs of the four heads are fused after channel concatenation. Subsequently, the feature map is input into an axial attention module: in the height direction attention, the feature map is flattened into 16 sequences of length 256 by rows for calculation; in the width direction attention, it is flattened into 16 sequences by columns for calculation. Finally, the enhanced features in the two directions are concatenated along the channel dimension and fused through a linear layer to output enhanced features that can simultaneously reflect the direction of multi-scale lesions and anatomical structures.

[0067] S4: Aggregate multi-scale features using a pyramid pooling strategy and input them into a classification head for disease probability prediction.

[0068] Specifically, the pyramid pooling module: Function: Aggregates global contextual information at different spatial scales to form a unified feature representation that includes both details and overall semantics.

[0069] Implementation: For the enhanced feature map in step S3, perform multiple adaptive average pooling operations in parallel. The target output size of these pooling operations is set to multiple different, preset small resolutions (e.g., 1×1, 2×2, 4×4). Adaptive pooling ensures that a fixed-size feature grid can be output regardless of the size of the input feature map. Subsequently, all pooling results are flattened into one-dimensional feature vectors, and these vectors are sequentially concatenated along the channel dimension to form a long feature vector.

[0070] Category Header: Function: Maps the aggregated feature vector to the probability of a specific cardiovascular disease category.

[0071] Implementation: Typically consists of a multi-layered, fully connected network, which includes: One or more hidden layers, each containing a linear transformation, a non-linear activation function (such as ReLU), and optional batch normalization or Dropout layers (to prevent overfitting).

[0072] An output layer has a number of neurons equal to the number of cardiovascular disease categories to be screened.

[0073] For multi-label classification tasks, the output layer typically uses the Sigmoid activation function to map the output of each neuron to a probability value in the interval [0,1], representing the likelihood of the corresponding disease.

[0074] Preferably, this strategy aggregates feature maps into multiple fixed grid representations of different spatial scales through parallel adaptive pooling operations. At the same time, it retains global contextual information containing overall semantics, medium-range regional correlation information, and fine local location information, forming a robust feature vector with multi-resolution fusion. This aggregation method effectively alleviates the problem of spatial information loss before final classification, enhances the model's generalization ability for target location and morphological changes, and provides a more comprehensive and stable decision basis for the final classifier.

[0075] For example, the enhanced feature map (e.g., [512, 8, 8]) is input into the pyramid pooling module: three adaptive average pooling operations are performed in parallel, outputting features of size [512, 1, 1], [512, 2, 2], and [512, 4, 4], respectively; these features are flattened into vectors of 512, 2048, and 8192 dimensions, and then concatenated along the channel dimension to form a 10752-dimensional comprehensive feature vector, which is then input into the classification head: first through a 1024-dimensional fully connected layer, activated by ReLU and regularized by Dropout; then through a 256-dimensional fully connected layer for nonlinear transformation; finally through an output layer with the number of neurons equal to the number of disease categories (e.g., 14 categories), activated by the Sigmoid function, outputting the independent predicted probability for each category.

[0076] S5: Outputs early screening results for cardiovascular diseases.

[0077] Specifically, the probability vector output by the classification head is used as the final result. This result can be presented to the physician directly in the form of a list (e.g., "cardiac hypertrophy: 92% probability, pulmonary edema: 15% probability..."), or it can be binarized after setting a threshold (e.g., if the probability is >50%, it indicates a positive result) and integrated into the diagnostic report.

[0078] For example, the model calculates a probability vector from an input X-ray image, such as [cardiac hypertrophy: 0.87, pulmonary edema: 0.12, aortic calcification: 0.95,...]. The system compares this vector with a preset threshold (e.g., 0.5) and generates a structured JSON report: {"Positive Findings": ["Cardiac Hypertrophy", "Aortic Calcification"], "Detailed Probabilities": {...}, "Recommendation": "Please pay close attention to cardiac morphology and aortic calcification in conjunction with clinical findings."}. This report can be automatically embedded into the diagnostic report draft of the radiology information system, highlighting positive results to assist physicians in quickly reviewing and issuing the final report.

[0079] In summary, this invention constructs a collaborative HeartCare-Net hybrid architecture and innovatively introduces a proxy attention mechanism to reduce computational complexity. It combines multi-scale dilated convolution with an axial attention module to achieve adaptive extraction of pathological features of different sizes and directions. Finally, it integrates multi-scale contextual information through a pyramid pooling strategy. This systematically solves the problems of existing methods in early screening of cardiovascular diseases, such as weak global modeling ability, heavy computational burden, insufficient multi-scale feature capture, and incomplete utilization of prior knowledge of medical image structures. As a result, it significantly improves the detection rate and diagnostic accuracy of early lesions while achieving lightweight and efficient inference. It has good clinical applicability, deployability, and generalization ability, providing reliable technical support for large-scale population screening and primary healthcare auxiliary diagnosis.

[0080] Example 2, an embodiment of the present invention, provides a medical image analysis system for early screening of cardiovascular diseases, comprising: a data acquisition module for acquiring chest X-ray medical image data and performing preprocessing; a feature extraction module for inputting the preprocessed image into a HeartCare-Net hybrid architecture neural network model for feature extraction and analysis; an extraction enhancement module for enhancing the extraction capability of pathological features of different scales and directions by employing a multi-scale dilated convolution structure and an axial attention module; a prediction input module for aggregating multi-scale features through a pyramid pooling strategy and inputting them into a classification head for disease probability prediction; and a result output module for outputting early screening results for cardiovascular diseases.

[0081] Example 3 is an embodiment of the present invention, which differs from the previous embodiment in that: If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0082] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.

[0083] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0084] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0085] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A medical image analysis method for early screening of cardiovascular diseases, characterized by: include: Acquire chest X-ray medical imaging data and perform preprocessing; The preprocessed image is input into the HeartCare-Net hybrid architecture neural network model for feature extraction and analysis; A multi-scale dilated convolutional structure and an axial attention module are employed to enhance the extraction capability of pathological features at different scales and directions. Multi-scale features are aggregated using a pyramid pooling strategy and input into a classification head for disease probability prediction. Output early screening results for cardiovascular diseases.

2. The medical image analysis method for early screening of cardiovascular diseases as described in claim 1, characterized in that: The preprocessing includes size scaling and pixel value normalization; The HeartCare-Net model employs a hybrid encoder combining convolutional neural networks (CNN) and Transformers, and reduces computational complexity through a proxy attention mechanism.

3. The medical image analysis method for early screening of cardiovascular diseases as described in claim 1, characterized in that: The HeartCare-Net model includes: CNN embedding layers are used for initial feature extraction and downsampling; The hybrid encoder consists of multiple stages, each stage including a local feature block, a downsampling layer, and a Transformer block in sequence. The multi-scale attention enhancement module includes a multi-scale dilatational attention module and an axial attention module; The pyramid pooling module is used for multi-scale feature aggregation. The classification header is used to output the disease prediction probability.

4. The medical image analysis method for early screening of cardiovascular diseases as described in claim 3, characterized in that: The local feature blocks adopt a multi-scale dilated convolution structure, including parallel 1×1 convolutions, 3×3 convolutions with a dilation rate of 1, and 3×3 convolutions with a dilation rate of 2. The features of each branch are concatenated in the channel dimension and then fused by 1×1 convolutions.

5. The medical image analysis method for early screening of cardiovascular diseases as described in claim 3, characterized in that: The Transformer block employs a proxy attention mechanism, which reduces the computational complexity of self-attention by introducing a proxy token. Specifically, it includes: Aggregate the query vectors into a fixed number of proxy tokens; Calculate the attention weights between the proxy token and all key vectors; Based on this weight, the value vector is weighted and aggregated to generate a proxy feature representation; Calculate the attention weights between the original query and the proxy features and aggregate them again.

6. The medical image analysis method for early screening of cardiovascular diseases as described in claim 3, characterized in that: The axial attention module calculates attention along the image height and width directions respectively, and then merges the outputs of the two directions by concatenating them in the channel dimension.

7. The medical image analysis method for early screening of cardiovascular diseases as described in claim 3, characterized in that: The pyramid pooling module pools the feature map into three resolutions: 1×1, 2×2, and 4×4, respectively, through parallel adaptive average pooling. The pooled features are then flattened and stitched together along the channel dimension.

8. A medical image analysis system for early screening of cardiovascular diseases, based on the medical image analysis method for early screening of cardiovascular diseases according to any one of claims 1 to 7, characterized in that: include, The data acquisition module is used to acquire medical image data from chest X-ray films and perform preprocessing. The feature extraction module is used to input the preprocessed image into the HeartCare-Net hybrid architecture neural network model for feature extraction and analysis. An extraction enhancement module is used to enhance the extraction capability of pathological features of different scales and directions by employing a multi-scale dilated convolution structure and an axial attention module. The prediction input module is used to aggregate multi-scale features through a pyramid pooling strategy and input them into the classification head for disease probability prediction. The results output module is used to output early screening results for cardiovascular diseases.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the medical image analysis method for early screening of cardiovascular diseases as described in any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the medical image analysis method for early screening of cardiovascular diseases as described in any one of claims 1 to 7.