Intelligent auxiliary diagnosis device and equipment for common diseases of knee

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a partially shared and independent branch design in the YOLO model, combined with bidirectional cross-attention feature fusion and a feature pyramid network, the problem of insufficient visual feature extraction in existing models for knee disease diagnosis is solved, thereby improving diagnostic efficiency and accuracy.

CN122245698APending Publication Date: 2026-06-19THE FIRST AFFILIATED HOSPITAL OF SUN YAT SEN UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: THE FIRST AFFILIATED HOSPITAL OF SUN YAT SEN UNIV
Filing Date: 2026-02-04
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing computer vision target detection models are unable to fully exploit the semantic correspondence and complementary information between two perspectives in the diagnosis of knee diseases, resulting in parameter redundancy or loss of perspective features, which affects diagnostic efficiency and accuracy.

Method used

The YOLO model is adopted, and common features and viewpoint-related high semantic features of the anterior and lateral views are extracted by introducing partially shared and independent branch designs in the backbone network. Feature fusion is performed through the neck network, including bidirectional cross-attention feature fusion, feature pyramid network and path aggregation network, to improve detection performance.

Benefits of technology

It enables the establishment of two-perspective semantic correspondences at multiple scales, improving the efficiency and accuracy of knee disease diagnosis and assisting doctors in making quick and accurate decisions.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245698A_ABST

Patent Text Reader

Abstract

This application proposes an intelligent auxiliary diagnostic device and equipment for common knee diseases. It constructs a YOLO detection model comprising a backbone network, a neck network, and a detection head. The backbone network includes an anteroposterior (AP) view processing branch and a lateral view processing branch. These branches share a common portion for extracting common features from both AP and lateral views, as well as an independent portion for extracting view-related semantically specific features. This satisfies both the network design requirement of efficiently utilizing shared parameters while preserving view-specific features, and the design requirement of a fusion module that establishes semantic correspondence between two viewpoints at multiple scales to improve detection performance. This ensures the efficiency and accuracy of AI-assisted diagnosis, enabling doctors to make quick and accurate decisions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to an intelligent auxiliary diagnostic device and equipment for common knee diseases. Background Technology

[0002] Knee diseases are diverse, including degenerative joint diseases, fractures, osteonecrosis, and bone tumors. Early detection and accurate diagnosis are crucial for patient treatment and prognosis. Traditional X-ray examinations are widely used in primary healthcare institutions due to their low cost and speed. However, image interpretation relies heavily on the radiologist's experience, leading to subjectivity and inconsistent diagnoses, especially in grassroots hospitals lacking experienced physicians, where misdiagnosis rates are relatively high.

[0003] Artificial intelligence (AI) has been widely applied in the medical field to solve many challenging medical problems. Deep learning, as an emerging machine learning technology, has the advantage of mining high-level semantic features of images and is currently being used in the diagnosis of various clinical diseases.

[0004] Existing computer vision object detection models mostly use single-view input or employ simple multi-channel stitching / fusion strategies, making it difficult to fully explore the semantic correspondence and complementary information between the two viewpoints. Furthermore, directly extracting features from the two inputs through independent or fully shared backbones often leads to either parameter redundancy (independent backbones) or loss of viewpoint features (fully shared backbones). Therefore, the efficiency and accuracy of current AI-assisted diagnosis cannot meet the needs of assisting doctors in making rapid and accurate decisions. Summary of the Invention

[0005] This application proposes an intelligent auxiliary diagnostic device and equipment for common knee diseases, which can solve one of the problems existing in the background art.

[0006] To achieve the above objectives, this application adopts the following technical solution: Firstly, an intelligent auxiliary diagnostic device for common knee diseases is provided, including: The preprocessing module is used to preprocess the obtained original anteroposterior and lateral radiographs to obtain the anteroposterior and lateral radiographs to be tested; and, A detection network, employing the YOLO model, is used to process the frontal and lateral views to obtain detection results. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

[0007] Based on the above technical solution, a YOLO detection model is constructed, comprising a backbone network, a neck network, and a detection head. The backbone network includes an anteroposterior (AP) view processing branch and a lateral view processing branch. The AP view processing branch and the lateral view processing branch share a common part for extracting common features from AP and lateral view images, as well as an independent part for extracting view-related high semantic features from AP and lateral view images. This satisfies the network design requirements of efficiently utilizing shared parameters while preserving view-specific features, and also meets the design requirements of a fusion module that establishes semantic correspondence between two viewpoints at multiple scales and improves detection performance. This ensures the efficiency and accuracy of AI-assisted diagnosis and can assist doctors in making quick and accurate decisions.

[0008] In one possible design of the first aspect, the shared portion comprises, in sequence: a first CBS module consisting of an attention mechanism focus layer, a convolutional layer, a batch normalization layer, and a SiLU activation function layer; a first C3T_X layer consisting of a CBS submodule, cross-layer residual connections, and a splicing sublayer combination; and a second CBS module, where X represents the number of residual layers. The independent component comprises, in sequence: a second C3T_X layer, a third CBS module, a third C3T_X layer, a fourth CBS module, and an SPP layer consisting of a maximum pooling layer and a CBS submodule. The outputs of the second C3T_X layer, the third C3T_X layer, and the SPP layer serve as the inputs to the neck network.

[0009] In one possible design of the first aspect, the neck network includes: a bidirectional cross-attention feature fusion (FFM) module, a feature pyramid network (FPN), and a path aggregation network (PAN). The FFM module includes: a first convolutional block attention (CBAM) module, a second CBAM module, a third CBAM module, a fourth CBAM module, a first query, key, and value (QKV) matrix, a second QKV matrix, a third QKV matrix, a fourth QKV matrix, a first splicing layer, a fifth CBS module, a sixth CBS module, a seventh CBS module, a first compression and excitation (SE) module, a second SE module, and a third SE module. Features from one side of the orthogonal slice processing branch serve as inputs to the first CBAM module and the second CBAM module. Features from one side of the lateral slice processing branch serve as inputs to the third CBAM module and the fourth CBAM module. The outputs of the first CBAM module and the first SE module serve as inputs to the third CBAM module. The first QKV matrix is input to the QKV matrix, and the outputs of the second and third CBAM modules are used as inputs to the first QKV matrix. The outputs of the fourth CBAM module and the first SE module are used as inputs to the fourth QKV matrix. The outputs of the first and second QKV matrices are used as inputs to the first stitching layer. The first stitching layer, the fifth CBS module, and the first SE module are connected sequentially. The third QKV matrix, the sixth CBS module, and the second SE module are connected sequentially. The fourth QKV matrix, the seventh CBS module, and the third SE module are connected sequentially. The feature pyramid network comprises: a first C3F_X layer combining CBS sub-modules and splicing sub-layers, an eighth CBS module, a second C3F_X layer, a ninth CBS module, a first upsampling layer, a second splicing layer, a third C3F_X layer, a tenth CBS module, a second upsampling layer, a third splicing layer, and a fourth C3F_X layer. The path aggregation network includes: an eleventh CBS module, a fourth splicing layer, a fifth C3F_X layer, a twelfth CBS module, a fifth splicing layer, and a sixth C3F_X layer. The inputs of the first C3F_X layer and the second C3F_X layer are the outputs of the SPP layer. The first C3F_X layer is connected to the eighth CBS module, and the second C3F_X layer is connected to the ninth CBS module. The outputs of the eighth and ninth CBS modules serve as the inputs of the first FFM module. The output of the first FFM module serves as the input of the first upsampling layer and the fifth stitching layer. The outputs of the two second C3T_X layers in the independent part serve as the inputs of the second FFM module. The outputs of the two third C3T_X layers in the independent part serve as the inputs of the third FFM module. The outputs of the first upsampling layer and the third FFM module serve as the inputs of the second stitching layer. The second stitching layer, the third C3F_X layer, and the tenth CBS module are connected sequentially. The output of the BS module serves as the input to the second upsampling layer and the fourth stitching layer. The output of the second FFM module and the second upsampling layer serves as the input to the third stitching layer. The third stitching layer is connected to the fourth C3F_X layer. The output of the fourth C3F_X layer serves as the input to the first detection head and the eleventh CBS module. The output of the tenth CBS module and the eleventh CBS module serves as the input to the fourth stitching layer. The fourth stitching layer is connected to the fifth C3F_X layer. The output of the fifth C3F_X layer serves as the input to the second detection head and the twelfth CBS module. The output of the twelfth CBS module and the first FFM module serves as the input to the fifth stitching layer. The fifth stitching layer is connected to the sixth C3F_X layer. The output of the sixth C3F_X layer serves as the input to the third detection head.

[0010] In one possible design of the first aspect, the C3T_X layer includes: a first branch consisting of a first CBS submodule and cross-layer residual connections; a second branch consisting of a second CBS submodule; the first branch and the second branch output to a first splicing sublayer; the output of the first splicing sublayer serves as the input to a third CBS submodule; and the output of the third CBS submodule serves as the output of the C3T_X layer. The SPP layer includes: a fourth CBS submodule, several max-pooling layers of different sizes, a second splicing sublayer, and a fifth CBS submodule. The output of the fourth CBS submodule serves as the input to the several max-pooling layers and the second splicing sublayer, and the output of the several max-pooling layers also serves as the input to the second splicing sublayer. The second splicing sublayer is connected to the fifth CBS submodule, and the output of the fifth CBS submodule serves as the output of the SPP layer. The C3F_X layer includes: a third branch consisting of a sixth CBS submodule and two CBS submodules, and a fourth branch consisting of a seventh CBS submodule. The third branch and the fourth branch are output to the third splicing sublayer. The output of the third splicing sublayer is used as the input of the eighth CBS submodule, and the output of the eighth CBS submodule is used as the output of the C3F_X layer.

[0011] In one possible design of the first aspect, the detection network employs a predicted box location loss, a classification loss, and a confidence loss.

[0012] In one possible design of the first aspect, the predicted box position loss for: in, Indicates the prediction box. Indicates the gold standard frame. , , , These are the width and height of the prediction frame and the gold standard frame, respectively. and These are the center point coordinates of the predicted bounding box and the gold standard bounding box, respectively. Here is the formula for calculating Euclidean distance. In order to put and The length of the diagonal of the smallest enclosed rectangle.

[0013] In one possible design approach of the first aspect, the classification loss for: in, Indicates the category of the predicted box. This represents the model's predicted value. This represents the sigmoid function, where n represents the number of predicted boxes. It is the weighting coefficient.

[0014] In one possible design approach of the first aspect, the confidence loss for: in, For the true label of the sample, This is the predicted probability corresponding to the true category. and These are the category balance factor and the focus parameter, respectively.

[0015] In a second aspect, an electronic device is provided, comprising: a processor, and a memory coupled to the processor, the memory being used to store a computer program; the processor being used to execute the computer program stored in the memory, such that the electronic device performs the following process: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

[0016] Thirdly, a computer-readable storage medium is provided, including a computer program or instructions that, when executed on a computer, cause the computer to perform the following processes: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of the structure of the intelligent auxiliary diagnostic device for common knee diseases provided in the embodiments of this application; Figure 2 This is a schematic diagram of the YOLO model structure provided in the embodiments of this application; Figure 3 This is a schematic diagram illustrating the localization and classification process for knee diseases in this application. Detailed Implementation

[0019] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0020] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0021] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0022] Before introducing the embodiments of this application, a brief description of the background research of this application will be given: Knee diseases are diverse, with some relatively rare in clinical practice. Few doctors possess comprehensive and specialized diagnostic knowledge. Furthermore, plain radiographs, being overlapping 2D images, can mask certain abnormalities and have low density resolution. Therefore, for most inexperienced junior doctors, when lesions are not obvious or they are unfamiliar with the disease, misdiagnosis and missed diagnoses are common, causing irreparable harm to patients. In addition, the variety of knee diseases makes it difficult for researchers to collect sufficient imaging data for each disease; current AI research on the knee focuses on single diseases.

[0023] Radiologists often combine anteroposterior and lateral radiographs when interpreting images to diagnose knee joint lesions. These two perspectives complement each other, improving lesion detection rate and localization accuracy. Existing computer vision target detection models mostly use single-view input or employ simple multi-channel stitching / fusion strategies, making it difficult to fully exploit the semantic correspondence and complementary information between the two perspectives. Furthermore, directly extracting features from the two inputs through independent or fully shared backbones often leads to either parameter redundancy (independent backbones) or loss of viewpoint features (fully shared backbones). Therefore, there is a need for a network design that can efficiently utilize shared parameters while preserving viewpoint-specific features, and a fusion module design that can establish semantic correspondences between the two perspectives at multiple scales to improve detection performance.

[0024] This application aims to provide a one-stop AI-assisted diagnostic system for knee diseases using X-ray plain films. Utilizing a deep learning model, it performs image recognition and intelligent analysis on standard X-ray plain films, enabling automatic identification and labeling of six common and clinically important diseases: degenerative joint diseases, fractures, bone infarction, benign bone tumors, intermediate-sized bone tumors, and malignant bone tumors. By improving diagnostic efficiency and accuracy, it assists doctors in making rapid decisions, possessing high research and application value in clinical practice, and is particularly suitable for primary healthcare institutions and telemedicine systems.

[0025] like Figure 1 As shown in the figure, this application provides an intelligent auxiliary diagnostic device for common knee diseases, including: Preprocessing module 101 is used to preprocess the obtained original frontal and lateral images to obtain the frontal and lateral images to be detected. Preprocessing may include adjusting image size, normalization, and enhancement (such as rotation, cropping, flipping, etc.); and... Detection network 102, employing the YOLO model, processes the frontal and lateral views to be detected, obtaining detection results. The YOLO model comprises, in sequence: a backbone network 1021 for feature extraction, a neck network 1022 for fusing the extracted features, and a detection head 1023 for processing the fused features to obtain the detection result. The backbone network 1021 includes an anteroposterior view processing branch 10211 that takes the anteroposterior view as input and a lateral view processing branch 10212 that takes the lateral view as input. The anteroposterior view processing branch 10211 and the lateral view processing branch 10212 share features for extracting common characteristics between the anteroposterior view and the lateral view. The system comprises a shared portion 103 and an independent portion 104 for extracting viewpoint-related high semantic features from the anteroposterior and lateral views to be detected. The shared portion 103 is used to perform primary feature extraction on the anteroposterior and lateral views to obtain primary anteroposterior and primary lateral view features, respectively. The independent portion 104 is used to perform secondary feature extraction on the primary anteroposterior features to obtain secondary anteroposterior features and on the primary lateral view features to obtain secondary lateral view features, respectively. The output of the independent portion 104 is used as the input of the neck network 1022.

[0026] By constructing a YOLO detection model comprising a backbone network, a neck network, and a detection head, the backbone network includes an anteroposterior (AP) view processing branch and a lateral view processing branch. The AP view processing branch and the lateral view processing branch share a common part for extracting common features from AP and lateral view images, as well as an independent part for extracting view-related high semantic features from AP and lateral view images. This satisfies the network design requirements of efficiently utilizing shared parameters while preserving view-specific features, and also meets the design requirements of a fusion module that establishes semantic correspondence between two viewpoints at multiple scales and improves detection performance. This ensures the efficiency and accuracy of AI-assisted diagnosis and can assist doctors in making quick and accurate decisions.

[0027] Figure 2 The specific structure of the YOLO model in an embodiment of this application is shown.

[0028] The YOLO model includes: a backbone network, a neck network, and a detection head.

[0029] (a) Backbone Network The shared portion of the backbone network comprises, in sequence: an attention mechanism layer, a first CBS module consisting of a convolutional layer, a batch normalization layer, and a SiLU activation function layer, a first C3T_X layer consisting of a CBS sub-module, cross-layer residual connections, and splicing sub-layers, and a second CBS module, where X represents the number of residual layers. The independent portion comprises, in sequence: a second C3T_X layer, a third CBS module, a third C3T_X layer, a fourth CBS module, and an SPP layer consisting of a max-pooling layer and a CBS sub-module. The outputs of the second C3T_X layer, the third C3T_X layer, and the SPP layer serve as the inputs to the neck network.

[0030] The partially shared backbone network can extract multi-scale, high-level semantic feature representations from anteroposterior and lateral knee joint images, respectively. It can learn both viewpoint-invariant low-level texture / edge features and features that differ and complement each other from different viewpoints.

[0031] Specifically, the CBS module / CBS submodule is used to extract features, standardize feature distribution, and introduce nonlinearity. It includes: Conv layer, Batch Normalization (BN), and SiLU activation layer. The Conv layer is a convolution operation used to extract local spatial features, the BN layer performs batch normalization, which can stabilize training, and the SiLU activation layer is used for nonlinear transformation to enhance expressive power.

[0032] The Focus layer is used to downsample the spatial dimensions of the input image (halving the width and height) while increasing the number of channels (multiplying by 4) and preserving local pixel information. It includes a slice operation layer, a concatenate sub-layer, and a CBS sub-module. The slice operation layer slices the input image into four smaller images according to an even / odd number of rows and columns. The concatenate sub-layer concatenates the four smaller images along the channel dimension. The CBS sub-module performs preliminary processing on the concatenated features.

[0033] The C3T_X layer is used to further extract deep features and enhance information flow capabilities, while maintaining lightweight design and enhancing feature extraction capabilities. The C3T_X layer includes: a first branch consisting of a first CBS submodule and a cross-layer residual connection (Res Unit), and a second branch consisting of a second CBS submodule. The first branch and the second branch output to a first splicing sublayer. The output of the first splicing sublayer serves as the input to a third CBS submodule, and the output of the third CBS submodule serves as the output of the C3T_X layer.

[0034] The Res Unit performs cross-layer residual connections between different CBS submodules and finally adds them together for output.

[0035] The SPP layer is used to capture contextual information at different scales, enhancing the model's adaptability to large, medium, and small targets. The SPP layer includes: a fourth CBS submodule, several max pooling layers of different sizes (such as 5×5, 9×9, and 13×13), a second splicing sublayer, and a fifth CBS submodule. The output of the fourth CBS submodule serves as the input to several of the max pooling layers and the second splicing sublayer, and the output of several of the max pooling layers also serves as the input to the second splicing sublayer. The second splicing sublayer is connected to the fifth CBS submodule, and the output of the fifth CBS submodule serves as the output of the SPP layer.

[0036] Regarding the shared portion: The Focus layer (Patch Rearrangement) only rearranges pixels and does not involve viewpoint differences, so it can be fully shared; the first CBS module extracts basic textures, which are consistent between the two viewpoints; C3T_2 (shallow Res-like block) extracts low-level contours and soft tissue grayscale changes, with almost no difference between frontal and lateral views, making sharing very reasonable; the second CBS module is still a shallow convolution, and sharing features between the two viewpoints is the most stable.

[0037] Regarding the independent section: The first C3T_6 module enters the spatial structure representation stage. At this point, the differences between the anterior (AP) and lateral (LAT) structures are very obvious, and forced sharing would confuse the spatial structure representation. The second C3T_6 module extracts deeper features and has entered the high semantic level, so it cannot be shared. SPP (Spatial Pyramid Pooling) extracts "region-level" structures through multi-scale pooling. The skeletal spatial relationships of AP and LAT are different, so they must be learned separately. The subsequent C3F_2 / fourth CBS module (before entering the neck network) has entered the semantic recognition stage, and it cannot be shared at all here, otherwise the subsequent dual detection head will have difficulty working.

[0038] By proposing the idea of "partial sharing", parameters are shared between the front and lateral sides in a designated layer of the backbone network, while subsequent higher layers remain independent to learn the feature differences from different perspectives. This achieves collaborative feature extraction and complementary expression at the structural level, which is different from the traditional "complete sharing" or "complete independence" schemes and combines parameter efficiency with perspective expression capabilities.

[0039] (ii) Neck network The neck network is used to fuse and enhance the multi-scale features output by the backbone network in order to better capture information about targets of different sizes and obtain features suitable for the detection head.

[0040] The neck network includes: a bidirectional cross-attention feature fusion (FFM) module, a feature pyramid network (FPN), and a path aggregation network (PAN).

[0041] Specifically, the FFM module includes: a first convolutional block attention (CBAM) module, a second CBAM module, a third CBAM module, a fourth CBAM module, a first query, key, and value (QKV) matrix, a second QKV matrix, a third QKV matrix, a fourth QKV matrix, a first splicing layer, a fifth CBS module, a sixth CBS module, a seventh CBS module, a first compression and excitation (SE) module, a second SE module, and a third SE module. Features from one side of the orthogonal slice processing branch serve as inputs to the first CBAM module and the second CBAM module, and features from one side of the lateral slice processing branch serve as inputs to the third CBAM module and the fourth CBAM module. The outputs of the first CBAM module and the first SE module serve as inputs to the... The input of the third QKV matrix, the output of the second CBAM module and the third CBAM module are used as the input of the first QKV matrix, the output of the second CBAM module and the third CBAM module are used as the input of the second QKV matrix, the output of the fourth CBAM module and the first SE module are used as the input of the fourth QKV matrix, the output of the first QKV matrix and the second QKV matrix are used as the input of the first stitching layer, the first stitching layer, the fifth CBS module and the first SE module are connected in sequence, the third QKV matrix, the sixth CBS module and the second SE module are connected in sequence, and the fourth QKV matrix, the seventh CBS module and the third SE module are connected in sequence.

[0042] By cascading CBAM, bidirectional cross-attention feature fusion (QKV), splicing, and SE modules onto the FFM module, a set of semantic alignment and channel reweighting processes for two perspectives is formed, which is different from conventional methods of simple splicing / weighting or unidirectional attention. The first-level fusion allows the two perspectives to complement each other and improves the stability of lesion identification; the second-level fusion allows the anteroposterior and lateral views to "look back at themselves" respectively, improving the personalized expression ability of the two perspectives. This two-level fusion structure completely corresponds to the doctor's cognitive process of "first combining the anteroposterior and lateral views, and then observing them separately".

[0043] The feature pyramid network includes: a first C3F_X layer combining CBS sub-modules and splicing sub-layers, an eighth CBS module, a second C3F_X layer, a ninth CBS module, a first upsampling (UPS) layer, a second splicing layer, a third C3F_X layer, a tenth CBS module, a second upsampling layer, a third splicing layer, and a fourth C3F_X layer.

[0044] The path aggregation network includes: the eleventh CBS module, the fourth splicing layer, the fifth C3F_X layer, the twelfth CBS module, the fifth splicing layer, and the sixth C3F_X layer.

[0045] The inputs of the first C3F_X layer and the second C3F_X layer are the outputs of the SPP layer. The first C3F_X layer is connected to the eighth CBS module, and the second C3F_X layer is connected to the ninth CBS module. The outputs of the eighth and ninth CBS modules serve as the inputs of the first FFM module. The output of the first FFM module serves as the input of the first upsampling layer and the fifth stitching layer. The outputs of the two second C3T_X layers in the independent section serve as the inputs of the second FFM module. The outputs of the two third C3T_X layers in the independent section serve as the inputs of the third FFM module. The outputs of the first upsampling layer and the third FFM module serve as the inputs of the second stitching layer. The second stitching layer, the third C3F_X layer, and the tenth CBS module are connected sequentially. The tenth CBS module... The output of the second upsampling layer is used as the input to the second upsampling layer and the fourth stitching layer. The output of the second FFM module and the second upsampling layer is used as the input to the third stitching layer. The third stitching layer is connected to the fourth C3F_X layer. The output of the fourth C3F_X layer is used as the input to the first detection head (Detect) and the eleventh CBS module. The output of the tenth CBS module and the eleventh CBS module is used as the input to the fourth stitching layer. The fourth stitching layer is connected to the fifth C3F_X layer. The output of the fifth C3F_X layer is used as the input to the second detection head and the twelfth CBS module. The output of the twelfth CBS module and the first FFM module is used as the input to the fifth stitching layer. The fifth stitching layer is connected to the sixth C3F_X layer. The output of the sixth C3F_X layer is used as the input to the third detection head.

[0046] The C3F_X layer includes: a third branch consisting of a sixth CBS submodule and two CBS submodules, and a fourth branch consisting of a seventh CBS submodule. The third branch and the fourth branch are output to the third splicing sublayer. The output of the third splicing sublayer is used as the input of the eighth CBS submodule, and the output of the eighth CBS submodule is used as the output of the C3F_X layer.

[0047] The FFM module is a fusion module located between the backbone network and the detection head. It includes: four CBAM attention sub-modules, two sets of cross-view QKV mapping units (initial fusion), a set of feature concatenation units, CBS convolutional units, SE channel attention units, and a second QKV cross-view enhancement unit for anterior and lateral views, along with its subsequent CBS and SE modules. It performs bidirectional multi-head cross-attention feature fusion (AP→LAT, LAT→AP) on AP and LAT features at multiple scales, establishing semantic correspondences across views. The main fusion process includes: CBAM (attention enhancement of features in spatial and channel dimensions to highlight key structural regions) pre-enhancement → Q / K / V projection → bidirectional multi-head cross-attention feature fusion calculation → bidirectional output stitching → passing through CBS and SE attention modules (to achieve channel-level adaptive weighting and fusion optimization, outputting feature maps that fuse structural and semantic information from both anteroposterior and lateral views) → lateral features and fused features are subjected to Q / K / V calculation to obtain lateral enhancement features for lateral image detection; anteroposterior features and fused features are subjected to Q / K / V calculation to obtain anteroposterior enhancement features for anteroposterior image detection → fed to multi-scale aggregation (FPN / PAN) and detection head.

[0048] Feature Pyramid Network (FPN) can fuse high-level semantic features with low-level spatial resolution features from top to bottom, improving the ability to detect small targets.

[0049] Path aggregation networks (PANs) can enhance the semantic information of low-level features from the bottom up, ensuring that gradients and information flow fully between multi-scale features.

[0050] The multi-scale feature map output by the neck network ensures that the detection head can simultaneously predict large, medium, and small targets.

[0051] (III) Detection Head The detection head predicts the multi-scale feature map output by the neck network to obtain the target category probability and bounding box information.

[0052] The detection head mainly includes: a convolutional prediction layer, which maps multi-scale feature maps to the prediction space and outputs the class probability and bounding box offset of each anchor point; an anchor mechanism, which provides predefined reference boxes to facilitate the prediction of targets of different sizes; and a non-maximum suppression (NMS) post-processing layer, which removes duplicate predictions and retains the best detection results.

[0053] The detection head ultimately outputs the category, confidence score, and bounding box position for each target, thus achieving complete target detection functionality.

[0054] (iv) Loss Function The detection network employs bounding box location loss, classification loss, and confidence loss.

[0055] Among them, the predicted box position loss of the stochastic gradient descent optimizer for: in, Indicates the prediction box. Indicates the gold standard frame. , , , These are the width and height of the prediction frame and the gold standard frame, respectively. and These are the center point coordinates of the predicted bounding box and the gold standard bounding box, respectively. Here is the formula for calculating Euclidean distance. In order to put and The length of the diagonal of the smallest enclosed rectangle.

[0056] The classification loss Calculated using the binary cross-entropy function with logistic regression, it is: in, Indicates the category of the predicted box. This represents the model's predicted value. This represents the sigmoid function, where n represents the number of predicted boxes. This is the weighting factor for the loss (default is 1).

[0057] The confidence loss Calculated using the focus loss function, it is: in, The probability that the model predicts is positive. For the true label of the sample, This is the predicted probability corresponding to the true category. and These are the category balance factor and the focus parameter, set to 0.25 and 2 respectively.

[0058] (v) YOLO Model Training Step 1: Data Preparation In this step, image data of the target to be detected is acquired (each knee joint data includes one frontal view and one lateral view), and the image data is labeled to generate a corresponding txt format annotation file. The annotation content includes the target category and bounding box coordinates. At the same time, the image data is preprocessed, such as adjusting the image size, normalizing, and enhancing (e.g., rotation, cropping, flipping, etc.) to improve the robustness and generalization ability of the model training.

[0059] Step 2: Construct a partially shared dual-view fusion network model 1. A dual-branch backbone network structure is constructed to process anteroposterior and lateral images of the knee joint. The backbone network includes multi-level convolutional feature extraction modules. To simultaneously consider the common features and view-specific features of the two views, this embodiment adopts a partially-shared structure design. Specifically: ① The first few layers of the backbone network ( Figure 2 The Focus-CBS-C3T_2-CBS architecture employs a shared-parameter convolutional structure to extract common low-level features such as texture, edges, and brightness gradients from both anteroposterior and lateral views. The high-level feature extraction module of the backbone network (other non-shared layers) uses an independent-parameter structure, corresponding to the anteroposterior and lateral branches respectively, to extract high-semantic features related to the viewing direction, such as bony structure morphology, projection differences, and lesion shadow manifestations. This approach allows images from both views to share shallow features in the initial stage, helping to reduce redundant computation and improve representation stability; while retaining independent structures in deeper stages ensures the model can accurately learn viewing-specific features, improving the reliability of medical structure recognition.

[0060] 2. The dual-view feature enhancement and fusion module in this application embodiment includes a two-level fusion structure: Level 1 fusion: bidirectional cross-fusion of anteroposterior (AP) and lateral (LAT) features.

[0061] Second-level fusion: The results of the first-level fusion are combined with the original AP / LAT features again for view-specific enhancement.

[0062] This two-level structure can simulate the diagnostic logic of "first comprehensive observation, then reviewing the two perspectives separately" in clinical image reading, enabling the model to simultaneously possess cross-perspective consistency learning ability and perspective-specific enhancement ability.

[0063] Level 1 fusion: Attention-guided fusion of original frontal and lateral features + bidirectional cross-fusion ① Input features The positive features from the backbone network are denoted as F. ap Lateral features are denoted as F lat .

[0064] ② Enhanced attention through CBAM perspective To enhance the key regions of the two perspectives respectively, the embodiments of this application first use F ap F lat The enhanced features are output from a separate CBAM and denoted as follows: , CBAM, through the combined action of channel attention and spatial attention, first enhances the salient regions of AP and LAT respectively, providing a stable feature base for subsequent cross-view fusion.

[0065] ③ Generate dual-view query / key / value features using QKV units Figure 2 The "QKV" unit includes: convolution + flattening, linear transformation, reshaping, and generation of Q, K, and V tensors. For , The QKV transformation yields the following: The generated Q, K, and V are used to perform cross-perspective information interaction.

[0066] ④ Bidirectional cross-attention fusion This step involves attention in two directions simultaneously: A) Query LAT from AP: B) Query AP from LAT: This step enables "mutual referencing" between two perspectives, simulating the process a clinician uses when interpreting images: "viewing from the anterior to the lateral view, and vice versa." Here, d represents the channel dimension of the query vector Q and the key vector V. By scaling the attention weights, the vanishing or exploding gradient problems in the Softmax operation are avoided.

[0067] ⑤ Feature concatenation + CBS convolutional fusion + SE module for global channel compression and recalibration The fusion features from the two directions are concatenated, then the feature dimensions are uniformly adjusted using CBS (Conv + BN + SiLU) before passing through the SE module to obtain the first-level fusion output: Second-level fusion: The fusion results are fed back to the anterior and lateral branches respectively (viewpoint-specific enhancement). To simulate the habit of doctors to view each view individually after fusing information from two perspectives, this application introduces a second-level cross-enhancement module to perform perspective-specific enhancement on the anteroposterior and lateral features respectively.

[0068] ① Positive path enhancement Original orthogonal features Obtained after another CBAM Features of first-level fusion Perform QKV transformation and cross-attention again: and Obtained through QKV fused by AP query: The final input features, which serve as the orthogonal image detection branch, are then obtained through CBS+SE. .

[0069] ② Lateral path enhancement Similarly, the final input features for the lateral detection head can be obtained. .

[0070] Step 3: Model Training In this step, the output layer parameters of the model are set according to the number of categories in the dataset. Simultaneously, network weights (pre-trained on the COCO dataset) are initialized using pre-trained weights for transfer learning to accelerate training convergence. Training parameters are set, including learning rate, batch size, number of epochs, optimizer type (SGD), and loss function (consisting of two parts: loss for frontal image detection and loss for lateral image detection). The pre-processed training data (frontal and lateral views simultaneously) is input into the two backbone branches of the dual-view fusion detection network for forward and backward propagation, iteratively updating the model parameters.

[0071] Specifically, the model was trained using an NVIDIA-SMI graphics card in a Python 3.8.5, Torch 1.7.1, and TensorFlow 2.6.0 environment. Approximately 80% of the total dataset was used for model training, and the remaining 20% was used for model validation. The batch size for training was 32, and the input image size to the network was 640. 640. Different X-ray films have different sizes; the input size set in this study is 640. 640. Different input sizes will produce different results. To obtain better prediction results, the input size can be changed to other values.

[0072] Step 4: Save the model In this step, the trained model is saved, including the model weight file and configuration file, as the basis for subsequent object detection applications.

[0073] The resulting trained dual-view fusion detection network combines the features of anteroposterior and lateral images of the knee joint to efficiently and accurately detect and classify target images. During the testing phase, both anteroposterior and lateral images of the knee joint can be input simultaneously for prediction, yielding prediction results for both images; alternatively, a single image (anteroposterior or lateral view) can be used for prediction, yielding a single image prediction result.

[0074] (V) Localization and classification of knee diseases like Figure 3 As shown, after constructing the deep learning detection model, an X-ray of the knee area is input. After preprocessing such as data augmentation, the detection model can detect bone tumors on the knee X-ray. First, the X-ray is preprocessed to 640... A 640 RGB image is input into the established model to obtain prediction results. If lesions are present, the detection model will locate them on the image. The accuracy of localization is evaluated by the Intersection over Union (IOU) metric between the predicted bounding box and the gold standard bounding box. The three detection heads of the detection model output three feature maps of different sizes to predict the input image; the larger the feature map, the smaller the target. Each image prediction result will have many candidate bounding boxes. Each bounding box consists of three parts: the box's category information, the box's location information, and the box's confidence score (0-1).

[0075] (vi) Post-processing of nonmaximum suppression After the detection model makes an initial prediction on the input image, many candidate prediction boxes will appear on the image. Most of these prediction boxes will overlap and are not the target. Therefore, the initial detection results need to be post-processed using the non-maximum suppression method to reduce false positive prediction boxes. The idea of non-maximum suppression is to first set an IOU threshold. When many candidate prediction boxes appear in an image: (1) Sort the confidence scores of all boxes and select the highest score and its corresponding box; (2) Traverse the remaining boxes. If the IOU with the current highest score box is greater than the set threshold, delete the box; (3) Continue to select the highest score box from the unprocessed boxes and repeat the above process until all boxes have been processed.

[0076] The embodiments of this application can also adapt clinical data through steps such as synchronous enhancement and alignment with regions of interest (ROI), so as to combine the structure with the application scenario and improve clinical usability.

[0077] Experimental design demonstrates that, under the same data and parameter budget, the model structure of this application embodiment significantly improves mAP and sensitivity compared to the single-view model, and maintains higher robustness in the case of missing views.

[0078] This application embodiment also provides an electronic device, including: a processor, and a memory coupled to the processor, the memory being used to store a computer program; the processor being used to execute the computer program stored in the memory, such that the electronic device performs the following process: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

[0079] Electronic devices can be computing devices such as desktop computers, laptops, handheld computers, and cloud servers. These electronic devices may include, but are not limited to, processors and memory.

[0080] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the electronic device, connecting various parts of the device via various interfaces and lines.

[0081] The memory can be used to store the computer program, and the processor implements various functions of the electronic device by running or executing the computer program stored in the memory and calling the data stored in the memory.

[0082] The memory may primarily include a program storage area and a data storage area. The program storage area may store the operating system, applications required for at least one function, etc.; the data storage area may store data created based on the use of the mobile phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0083] This application embodiment also provides a storage medium, which is a computer-readable storage medium, in which a computer program is stored. When the computer program is executed by a processor, it can perform the following process: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

[0084] The computer program includes computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc.

[0085] This application also provides a computer program product, including: a computer program or instructions that, when the computer program or instructions are run on a computer, cause the computer to perform any of the above possible implementation methods.

[0086] The above description is the preferred embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications are also considered to be within the scope of protection of this application.

Claims

1. An intelligent auxiliary diagnostic device for common knee diseases, characterized in that, include: The preprocessing module is used to preprocess the obtained original anteroposterior and lateral views to obtain the anteroposterior and lateral views to be tested. as well as, A detection network, employing the YOLO model, is used to process the frontal and lateral views to obtain detection results. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

2. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 1, characterized in that, The shared portion comprises, in sequence: an attention mechanism focus layer, a first CBS module consisting of a combination of convolutional layers, batch normalization layers, and SiLU activation function layers, a first C3T_X layer consisting of a CBS sub-module, cross-layer residual connections, and splicing sub-layers, and a second CBS module, where X represents the number of residual layers. The independent component comprises, in sequence: a second C3T_X layer, a third CBS module, a third C3T_X layer, a fourth CBS module, and an SPP layer consisting of a maximum pooling layer and a CBS submodule. The outputs of the second C3T_X layer, the third C3T_X layer, and the SPP layer serve as the inputs to the neck network.

3. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 2, characterized in that, The neck network includes: a bidirectional cross-attention feature fusion (FFM) module, a feature pyramid network (FPN), and a path aggregation network (PAN). The FFM module includes: a first convolutional block attention (CBAM) module, a second CBAM module, a third CBAM module, a fourth CBAM module, a first query, key, and value (QKV) matrix, a second QKV matrix, a third QKV matrix, a fourth QKV matrix, a first splicing layer, a fifth CBS module, a sixth CBS module, a seventh CBS module, a first compression and excitation (SE) module, a second SE module, and a third SE module. Features from one side of the orthogonal slice processing branch serve as inputs to the first CBAM module and the second CBAM module. Features from one side of the lateral slice processing branch serve as inputs to the third CBAM module and the fourth CBAM module. The outputs of the first CBAM module and the first SE module serve as inputs to the third CBAM module. The first QKV matrix is input to the QKV matrix, and the outputs of the second and third CBAM modules are used as inputs to the first QKV matrix. The outputs of the fourth CBAM module and the first SE module are used as inputs to the fourth QKV matrix. The outputs of the first and second QKV matrices are used as inputs to the first stitching layer. The first stitching layer, the fifth CBS module, and the first SE module are connected sequentially. The third QKV matrix, the sixth CBS module, and the second SE module are connected sequentially. The fourth QKV matrix, the seventh CBS module, and the third SE module are connected sequentially. The feature pyramid network comprises: a first C3F_X layer combining CBS sub-modules and splicing sub-layers, an eighth CBS module, a second C3F_X layer, a ninth CBS module, a first upsampling layer, a second splicing layer, a third C3F_X layer, a tenth CBS module, a second upsampling layer, a third splicing layer, and a fourth C3F_X layer. The path aggregation network includes: an eleventh CBS module, a fourth splicing layer, a fifth C3F_X layer, a twelfth CBS module, a fifth splicing layer, and a sixth C3F_X layer. The inputs of the first C3F_X layer and the second C3F_X layer are the outputs of the SPP layer. The first C3F_X layer is connected to the eighth CBS module, and the second C3F_X layer is connected to the ninth CBS module. The outputs of the eighth and ninth CBS modules serve as the inputs of the first FFM module. The output of the first FFM module serves as the input of the first upsampling layer and the fifth stitching layer. The outputs of the two second C3T_X layers in the independent part serve as the inputs of the second FFM module. The outputs of the two third C3T_X layers in the independent part serve as the inputs of the third FFM module. The outputs of the first upsampling layer and the third FFM module serve as the inputs of the second stitching layer. The second stitching layer, the third C3F_X layer, and the tenth CBS module are connected sequentially. The output of the BS module serves as the input to the second upsampling layer and the fourth stitching layer. The output of the second FFM module and the second upsampling layer serves as the input to the third stitching layer. The third stitching layer is connected to the fourth C3F_X layer. The output of the fourth C3F_X layer serves as the input to the first detection head and the eleventh CBS module. The output of the tenth CBS module and the eleventh CBS module serves as the input to the fourth stitching layer. The fourth stitching layer is connected to the fifth C3F_X layer. The output of the fifth C3F_X layer serves as the input to the second detection head and the twelfth CBS module. The output of the twelfth CBS module and the first FFM module serves as the input to the fifth stitching layer. The fifth stitching layer is connected to the sixth C3F_X layer. The output of the sixth C3F_X layer serves as the input to the third detection head.

4. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 3, characterized in that, The C3T_X layer includes: a first branch consisting of a first CBS submodule and cross-layer residual connections; a second branch consisting of a second CBS submodule; the first branch and the second branch output to a first splicing sublayer; the output of the first splicing sublayer serves as the input to a third CBS submodule; and the output of the third CBS submodule serves as the output of the C3T_X layer. The SPP layer includes: a fourth CBS submodule, several max-pooling layers of different sizes, a second splicing sublayer, and a fifth CBS submodule. The output of the fourth CBS submodule serves as the input to the several max-pooling layers and the second splicing sublayer, and the output of the several max-pooling layers also serves as the input to the second splicing sublayer. The second splicing sublayer is connected to the fifth CBS submodule, and the output of the fifth CBS submodule serves as the output of the SPP layer. The C3F_X layer includes: a third branch consisting of a sixth CBS submodule and two CBS submodules, and a fourth branch consisting of a seventh CBS submodule. The third branch and the fourth branch are output to the third splicing sublayer. The output of the third splicing sublayer is used as the input of the eighth CBS submodule, and the output of the eighth CBS submodule is used as the output of the C3F_X layer.

5. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 1, characterized in that, The detection network employs bounding box location loss, classification loss, and confidence loss.

6. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 5, characterized in that, The predicted box position loss for: in, Indicates the prediction box. Indicates the gold standard frame. , , , These are the width and height of the prediction frame and the gold standard frame, respectively. and These are the center point coordinates of the predicted bounding box and the gold standard bounding box, respectively. Here is the formula for calculating Euclidean distance. In order to put and The length of the diagonal of the smallest enclosed rectangle.

7. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 5, characterized in that, The classification loss for: in, Indicates the category of the predicted box. This represents the model's predicted value. This represents the sigmoid function, where n represents the number of predicted boxes. It is the weighting coefficient.

8. The intelligent auxiliary diagnostic device for common knee diseases as described in claim 5, characterized in that, The confidence loss for: in, The probability that the model predicts is positive. For the true label of the sample, This is the predicted probability corresponding to the true category. and These are the category balance factor and the focus parameter, respectively.

9. An electronic device, characterized in that, The electronic device includes: a processor, and a memory coupled to the processor. The memory is used to store computer programs; The processor is configured to execute the computer program stored in the memory, causing the electronic device to perform the following process: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a computer program or instructions that, when executed on a computer, cause the computer to perform the following processes: The obtained original anteroposterior and lateral radiographs are preprocessed to obtain the anteroposterior and lateral radiographs to be tested; and, The YOLO model was used to process the frontal and lateral views to be detected, and the detection results were obtained. The YOLO model comprises, in sequence: a backbone network for feature extraction, a neck network for fusing the extracted features, and a detection head for processing the fused features to obtain the detection result. The backbone network includes: an anterior view processing branch that takes the anterior view to be detected as input and a lateral view processing branch that takes the lateral view to be detected as input. The anterior view processing branch and the lateral view processing branch share a common part for extracting common features of the anterior view to be detected and the lateral view to be detected, and an independent part for extracting view-related high semantic features of the anterior view to be detected and the lateral view to be detected. The shared part is used to perform primary feature extraction on the anterior view to be detected and the lateral view to be detected to obtain primary anterior view features and primary lateral view features. The independent part is used to perform secondary feature extraction on the primary anterior view features to obtain secondary anterior view features and secondary feature extraction on the primary lateral view features to obtain secondary lateral view features. The output of the independent part is used as the input of the neck network.