An occluded object detection method and system based on multi-view fusion

By employing a multi-view fusion method for occluded target detection, this approach achieves accurate alignment and efficient fusion of cross-view features, resolving the issue of blurred recognition in occluded areas and improving detection accuracy and real-time performance. It is applicable to scenarios such as intelligent monitoring and autonomous driving.

CN122265631APending Publication Date: 2026-06-23NORTH CHINA UNIV OF WATER RESOURCES & ELECTRIC POWER

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTH CHINA UNIV OF WATER RESOURCES & ELECTRIC POWER
Filing Date
2026-03-26
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies for occluded target detection suffer from problems such as inaccurate cross-view feature alignment, low fusion efficiency, blurred identification of occluded areas, and difficulty in balancing detection accuracy and real-time performance. In particular, they are unable to meet the needs of autonomous driving and real-time monitoring in complex scenarios.

Method used

A multi-view fusion detection method is adopted, which synchronously acquires images through distributed image acquisition devices and calibrates installation errors in real time. It combines an improved ResNet50 network to extract features, uses Kalman filtering to dynamically correct extrinsic parameters, adopts an occlusion-aware attention-weighted fusion strategy, and combines an improved YOLOv8 network for target detection, achieving accurate alignment and efficient fusion of cross-view features.

Benefits of technology

It improves cross-view alignment accuracy, enhances the accuracy of occluded area recognition, meets the detection performance requirements of autonomous driving and real-time monitoring, with a detection accuracy of ≥95%, a false negative rate of ≤5%, and an output frame rate of ≥20fps, while reducing deployment costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265631A_ABST
    Figure CN122265631A_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on multi-view fusion's occlusion target detection method and system, belong to computer vision technical field, it is applicable to intelligent monitoring, automatic driving etc. Scene.For the problems of inaccurate cross-view alignment, inefficient fusion, fuzzy occlusion recognition and poor real-time performance in the prior art, the present application synchronously collects images and calibrates parameters through at least 2 devices;After adaptive filtering and distortion correction preprocessing, enhanced features are extracted using improved ResNet50 containing CBAM;Based on Kalman filtering dynamic alignment cross-view features, identify the occlusion area by combining feature missing rate, similarity and multi-dimensional index of depth difference;Optimize features using hierarchical fusion strategy and input improved YOLOv8 containing occlusion branch for detection.The alignment error of the present application is not more than 2 pixels, the occlusion recognition accuracy is not less than 92%, the detection accuracy is not less than 95% and the frame rate is not less than 20fps, effectively solving the bottleneck of occlusion scene detection, adapting to latency-sensitive applications.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and target detection technology, specifically to an occluded target detection method and system based on multi-view fusion, which is applicable to core application areas that require accurate identification of occluded targets, such as intelligent monitoring, autonomous driving, robot navigation, and intelligent transportation. Background Technology

[0002] Object detection is one of the core tasks of computer vision, and its performance directly determines the reliability of downstream applications. In real-world scenarios, object occlusion (such as pedestrians occluding each other, vehicles occluding pedestrians, and objects occluding key targets) is a major bottleneck leading to a decrease in detection accuracy. Single-view detection methods rely solely on information from a single image, resulting in a severe loss of target features in occluded areas, leading to a false negative rate as high as 30%-50%, which cannot meet the needs of practical applications. For example, in densely populated areas of shopping malls, single-view cameras often misidentify multiple people as a single target due to overlapping pedestrians, or completely miss occluded children. In warehouse robot operation scenarios, when shelves obscure goods, single-view detection is prone to misjudging the type or quantity of goods, leading to sorting errors. Existing multi-view detection methods have three major drawbacks: Inaccurate cross-view feature alignment: Relying solely on fixed extrinsic parameter mapping without considering spatial misalignment caused by equipment installation errors and dynamic scene changes leads to interference during feature fusion; Especially in severe weather conditions such as heavy rain and backlighting, camera lens parameters are prone to temporary shifts, and fixed extrinsic parameters cannot be adjusted in real time, causing cross-view feature matching errors to expand to 8-12 pixels, directly affecting detection accuracy; The fusion strategy is inefficient: it uses simple splicing or weighted averaging, without distinguishing the differences between shallow texture and deep semantic features, resulting in low utilization of effective information. For example, in autonomous driving scenarios, shallow texture features (such as door lines) from the side view of the vehicle and deep semantic features (such as license plate information) from the front view have different priorities. Simple fusion will mask key semantic information and lead to misjudgment of the target category. The identification of occluded areas is ambiguous: relying solely on single-view depth information or threshold judgment cannot accurately distinguish the difference between occlusion and the texture of the target itself, resulting in a high false detection rate. For example, at intelligent traffic intersections, the texture of vehicles in the shadow area is similar to the occlusion area features of adjacent vehicles. Existing methods often misjudge shadows as occlusions or miss small non-motorized vehicles that are actually occluded. Existing technologies lack real-time performance, with frame rates typically below 15fps in complex scenarios, making them unsuitable for latency-sensitive applications such as autonomous driving and real-time monitoring. For instance, autonomous vehicles need to complete target detection and decision-making within 100ms, but existing multi-view methods, due to their cumbersome feature processing, take over 200ms for a single detection, which can lead to delayed emergency avoidance. In real-time urban road monitoring, low frame rates can cause occluded target trajectories to break, making continuous tracking impossible. Therefore, there is an urgent need for a detection scheme that can achieve accurate cross-view feature alignment, efficient fusion, and accurate occlusion recognition to overcome the performance bottlenecks of existing technologies. Summary of the Invention

[0003] This invention aims to solve the technical problems existing in the detection of occluded targets, such as inaccurate cross-view feature alignment, low fusion efficiency, blurred identification of occluded areas, and difficulty in balancing detection accuracy and real-time performance. It provides a multi-view fusion occluded target detection scheme that balances accuracy and speed.

[0004] To achieve the above objectives, the present invention provides the following technical solution: An occluded target detection method based on multi-view fusion includes the following steps: (1) Multi-view image acquisition: At least two distributed image acquisition devices are used to achieve synchronous image acquisition through a time synchronization module with an acquisition frequency of not less than 25fps. The device internal and external parameters are acquired synchronously and the installation error is calibrated in real time. (2) Image preprocessing: The acquired images are subjected to adaptive median filtering for noise reduction with a filtering window of 3×3 to 7×7, perspective transformation distortion correction, and scale normalization in sequence; (3) In-view feature extraction and enhancement: Based on the improved ResNet50 network with added CBAM module, shallow texture features (C1 to C2 layers) and deep semantic features (C3 to C5 layers) are extracted. Features are optimized by adaptive weighting and edge enhancement based on Sobel operator, and feature dimensions are unified. (4) Cross-view feature alignment: Kalman filtering is used to dynamically correct extrinsic parameters. The improved ORB algorithm is used to match feature points and map them to the reference view coordinate system. After error verification with pixel deviation not exceeding 2 and interpolation completion, an aligned feature set is generated. (5) Occlusion region identification: Based on three multi-dimensional indicators, namely feature missing rate, cross-view feature similarity and optional depth difference, pixel-level marking and occlusion degree evaluation of occlusion region are realized. When the feature missing rate is not less than 40%, it is judged as potential occlusion; when the cross-view feature similarity is less than 0.6, it is judged as occlusion; and when the depth difference is less than 0.5m, it is used for auxiliary verification. (6) Multi-feature hierarchical fusion: shallow features are fused using occlusion perception attention weighting, and deep features are optimized by dimensionality reduction after channel splicing and ECA-Net filtering; (7) Target detection and optimization: The fused features are input into the improved YOLOv8 network with a dedicated occlusion branch. By improving NMS and bounding box correction, the target category, bounding box, confidence score of not less than 0.5 and occlusion status are output, and the detection frame rate is not less than 20fps.

[0005] An occluded target detection system based on multi-view fusion, used to implement the method of claim 1, includes: The system comprises a multi-view image acquisition module, an image preprocessing module, a feature extraction and enhancement module, a cross-view feature alignment module, an occlusion region recognition module, a multi-feature fusion module, a target detection and control module, and a result output and storage module. The multi-view image acquisition module includes at least two high-definition cameras, a GPS synchronizer, and an extrinsic parameter calibrator. The image preprocessing module is hardware-accelerated using FPGA. The feature extraction and enhancement module utilizes an improved ResNet50 deployed on a GPU. The cross-view feature alignment module employs CPU and GPU collaborative computing. The multi-feature fusion module is accelerated using a GPU. The target detection and control module utilizes an improved YOLOv8 deployment. The result output and storage module supports JSON and XML format output with a storage rate of no less than 100MB / s. Furthermore, the time synchronization module in step (1) is a GPS synchronizer or hardware trigger module with a synchronization accuracy of no more than 1ms, and the external parameter calibration is achieved by an external parameter calibrator with a calibration error of no more than 0.5°. Furthermore, the feature point mapping in step (4) is achieved through a perspective projection matrix. After error verification, outliers are removed, and the missing areas are filled using bilinear interpolation. Furthermore, the degree of occlusion in step (5) is divided into mild, moderate and severe. Mild occlusion corresponds to a feature loss rate of 40% to 59%, moderate occlusion corresponds to a feature loss rate of 60% to 79%, and severe occlusion corresponds to a feature loss rate of no less than 80%. The occluded area is marked by a mask value. Furthermore, the improved NMS in step (7) dynamically adjusts the threshold by occlusion confidence. The higher the occlusion confidence, the lower the threshold by 0.1 to 0.2 to avoid missed detections. Furthermore, the image preprocessing module uses a Xilinx Zynq UltraScale+ series FPGA, while the feature extraction and fusion module uses an NVIDIA A100 series GPU or an RTX 4090 series GPU. Furthermore, the result output and storage module outputs the detection results through an Ethernet interface, while storing the original image and detection log. The log includes device parameters, occlusion assessment data, and detection confidence information.

[0006] Compared with the prior art, the present invention has the following significant advantages: 1. Improved cross-view alignment accuracy: Through dynamic extrinsic calibration and improved ORB feature matching, the alignment error is reduced from 5-8 pixels in existing technologies to ≤2 pixels, and the feature alignment accuracy is improved by ≥60%; 2. Improved accuracy of occlusion recognition: Based on multi-dimensional occlusion judgment indicators, the accuracy of occlusion area recognition is ≥92%, which is more than 35% higher than the existing single-view depth judgment method, and effectively reduces the false detection rate (false detection rate ≤3%). 3. Detection performance balances accuracy and real-time performance: The layered fusion strategy improves feature utilization by 40%. Combined with improved YOLOv8 and hardware acceleration, the detection accuracy is ≥95% and the false negative rate is ≤5% in complex occlusion scenarios (such as densely populated intersections), with an output frame rate of ≥20fps, meeting the latency requirements of autonomous driving and real-time monitoring. 4. Strong scene adaptability: Supports flexible expansion of 2 to 8 viewpoints, can adapt to different indoor and outdoor lighting and occlusion intensity scenes, does not require a large amount of scene-specific training data, and reduces deployment costs by 25%. Attached Figure Description

[0007] Figure 1 This is a flowchart illustrating the implementation of an occlusion target detection method based on multi-view fusion according to the present invention. Figure 2 This is a system framework diagram of an occlusion target detection system based on multi-view fusion according to the present invention. Detailed Implementation

[0008] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0009] This invention proposes a method and system for detecting occluded targets based on multi-view fusion, characterized in that: (I) Occlusion Target Detection Method Based on Multi-view Fusion The method includes the following steps: Multi-view image acquisition (S1) N (N≥2) distributed image acquisition devices are used to achieve synchronous acquisition of multi-view images through a time synchronization module (such as GPS synchronization or hardware triggering), with an acquisition frequency of ≥25fps; Simultaneously acquire the external parameters (position, attitude) and internal parameters (focal length, distortion coefficient) of each acquisition device, and calibrate the device installation error in real time through the external parameter calibration unit; Image preprocessing (S2) Denoising: An adaptive median filtering algorithm is used to dynamically adjust the filtering window (3×3~7×7) according to the image noise intensity to remove Gaussian noise and salt-and-pepper noise; Distortion correction: Based on the intrinsic parameters of the acquisition device, lens distortion is corrected using the perspective transformation formula to obtain a distortion-free image; Scale normalization: Adjust all viewpoint images to a preset resolution (such as 640×480 or 1280×960) to ensure consistency in feature extraction; In-view feature extraction and enhancement (S3) Feature extraction: An improved ResNet50 deep convolutional neural network is used, with CBAM (convolutional block attention module) added in stages 3 to 5 of the network to extract shallow texture features (C1 to C2 layers) and deep semantic features (C3 to C5 layers) of images from each viewpoint. Feature enhancement: Through an adaptive feature weighting unit, channel weights are dynamically allocated based on feature response values ​​to enhance the features of the target region; at the same time, an edge enhancement unit (based on the Sobel operator) highlights the target contour information and generates an enhanced in-view feature map; and through a feature dimension unification unit, shallow texture features and deep semantic features are adjusted to the same spatial resolution (e.g., 160×160×256) to provide a unified input format for cross-view alignment. Cross-view feature alignment (S4) Dynamic extrinsic parameter calibration: Based on the device extrinsic parameters obtained in real time in step S1, combined with dynamic changes in the scene (such as sudden changes in illumination and micro-displacement of the device), the spatial mapping relationship of each viewpoint is dynamically updated through the extrinsic parameter correction model (based on Kalman filtering) to solve the alignment deviation caused by fixed extrinsic parameters; Feature point matching and coordinate mapping: For the enhanced feature maps of each viewpoint, the improved ORB (feature point detection algorithm optimized for real-time performance) is used to extract key feature points (such as target edge corners and texture abrupt change points), and the spatial coordinate correspondence of feature points across viewpoints is calculated; the feature points of non-reference viewpoints (any one viewpoint is selected as the reference) are mapped to the reference viewpoint coordinate system through the perspective projection matrix to achieve feature space alignment. Alignment error verification: Set an alignment error threshold (e.g., pixel deviation ≤ 2), perform consistency verification on the mapped feature points, and remove abnormal points with excessive deviation; for feature points that pass the verification, use bilinear interpolation algorithm to fill in the missing areas in the feature map and generate an aligned cross-view feature set; Occlusion area recognition (S5) Construction of multi-dimensional occlusion judgment indicators: ① Feature missing rate: Statistically calculate the percentage of pixels in the feature map of each view where the feature response value of the target region is lower than a preset threshold (e.g., 0.3). If the percentage is ≥40%, it is determined to be a potential occlusion region. ② Cross-view consistency: Compare the feature consistency of the same spatial location in different viewpoints after alignment (such as cosine similarity of feature vectors). If the similarity is <0.6 and the feature missing rate is ≥30%, it is determined to be an occluded area. ③ Depth-assisted verification (optional): If the acquisition device contains a depth sensor, the occlusion area is further confirmed by combining the depth difference between the target and the occlusion in the depth image (depth difference < preset distance threshold, such as 0.5m); Occlusion region marking: The occlusion region marking unit marks the determined occlusion region at the pixel level (e.g., assigns a specific mask value) and records the spatial range and degree of occlusion (slight / moderate / severe) of the occlusion region, providing a basis for subsequent fusion strategy adjustment; Multi-feature hierarchical fusion (S6) Layered fusion strategy design: ① Shallow texture feature fusion: For the aligned shallow features (C1~C2 layers), occlusion-aware attention-weighted fusion is adopted. For non-occluded areas, weights are assigned according to the feature clarity of each view (calculated based on edge strength) (the higher the clarity, the greater the weight); for occluded areas, only the features of the non-occluded view are retained and given high weights (weight coefficient ≥ 0.8) to avoid interference from occlusion features. ② Deep semantic feature fusion: For deep features (C3~C5 layers), the "feature concatenation + channel attention filtering" fusion is adopted - first, the deep features from each perspective are concatenated in the channel dimension, and then the channel attention module (ECA-Net) is used to filter the channel features that contribute highly to the target detection (retaining the top 80% of the channels with the highest contribution) to reduce redundant information; Feature fusion optimization: The dimensionality of the fused feature map is reduced by a 1×1 convolutional layer (e.g., the number of channels is reduced from 1024 to 512), and Batch Normalization is used to accelerate model convergence and ensure real-time performance. Target detection and result optimization (S7) Detection based on fusion features: The optimized fusion features are input into the improved YOLOv8 detection network (optimized for occlusion scenarios: a dedicated prediction branch for occluded targets is added to the detection head, and the output of the occluded target's category, bounding box, and occlusion confidence score). Post-processing optimization: ① Improved Non-Maximum Suppression (NMS): For the predicted bounding box of occluded targets, the occlusion confidence is introduced to correct the NMS threshold (the higher the occlusion confidence, the lower the threshold should be to avoid missed detections). ② Bounding box correction: Based on the feature space coordinates after cross-view alignment, the position of the bounding box of the occluded target is corrected (e.g., the bounding box details are supplemented by the target contour information from the non-occluded view). Detection results output: Output target category (e.g., pedestrian, vehicle, obstacle), accurate bounding box coordinates, confidence score (≥0.5) and occlusion status (no occlusion / slight occlusion / moderate occlusion / severe occlusion), output frame rate ≥20fps. (II) Occlusion Target Detection System Based on Multi-view Fusion This system is used to implement the above detection methods and includes the following functional modules and hardware support: Multi-view image acquisition module Hardware components: N (N≥2) high-definition industrial cameras (resolution≥1920×1080, frame rate≥30fps), GPS time synchronizer (synchronization accuracy≤1ms), external parameter calibrator (supports dynamic calibration, calibration error≤0.5°). Function: Enables simultaneous acquisition of multi-view images and real-time output of image data and device internal / external parameters. Image preprocessing module Hardware support: FPGA chips (such as Xilinx Zynq UltraScale+) are used to accelerate parallel computing for denoising and distortion correction; Function: Performs denoising, distortion correction, and scale normalization in step S2, and outputs the preprocessed image. Feature extraction and enhancement module Hardware support: GPU (such as NVIDIA A100) or AI accelerator card, deploying an improved ResNet50 network; Function: Performs feature extraction and enhancement in step S3, and outputs an enhanced feature map with uniform dimensions. Cross-view feature alignment module Hardware support: CPU (such as Intel Xeon Gold) and GPU co-computing; Function: Performs dynamic extrinsic parameter calibration, feature point matching and coordinate mapping in step S4, and outputs an aligned feature set. Occlusion area recognition module Function: Based on the multi-dimensional indicators in step S5, the function realizes pixel-level marking of occlusion areas and evaluation of occlusion degree, and outputs occlusion mask and evaluation results. Multi-feature fusion module Hardware support: GPU-accelerated fused computing; Function: Performs the hierarchical fusion and feature optimization in step S6, and outputs the dimensionality-reduced fused features. Target detection and control module Hardware support: GPU + CPU collaboration to deploy an improved YOLOv8 network; Function: Performs target detection and post-processing optimization in step S7, outputs detection results; and simultaneously feeds back synchronization control signals to the acquisition module to ensure system timing consistency. Result Output and Storage Module Function: Outputs detection results via Ethernet interface (supports JSON / XML format) and stores raw images and detection logs (storage rate ≥100MB / s) for subsequent tracing and model iteration optimization.

[0010] To make the technical solution of this invention easier to understand, the following detailed description is provided in conjunction with a specific application scenario (intelligent traffic monitoring at urban intersections): 1. System Deployment Data acquisition equipment: Four high-definition cameras (2560×1440 resolution, 30fps) are deployed at the four corners of the intersection, with a camera height of 3.5m and a field of view covering the entire intersection area; equipped with a GPS synchronizer (synchronization accuracy 0.5ms) and an external parameter calibrator (initial calibration error 0.3°). Hardware platform: It adopts an architecture of "FPGA (Xilinx Zynq) + GPU (NVIDIA RTX 4090) + CPU (Intel i9-13900K)". The FPGA is responsible for preprocessing, the GPU is responsible for feature extraction, alignment, fusion and detection, and the CPU is responsible for system control and data interaction. 2. Method Execution Steps S1: The camera simultaneously acquires images of the intersection (including pedestrians, vehicles, and non-motorized vehicles), and simultaneously obtains intrinsic parameters (focal length 5mm, distortion coefficient 0.01) and extrinsic parameters (updated in real time to correct for slight camera displacement caused by wind). S2: The FPGA performs adaptive median filtering (window 3×3~5×5), perspective transformation distortion correction, and normalizes the image to 1280×720 resolution; S3: GPU runs an improved ResNet50 (with added CBAM module) to extract shallow features (C2 layer: 320×320×128) and deep features (C5 layer: 40×40×1024), and outputs an enhanced feature map through adaptive weighting and Sobel edge enhancement; S4: Based on Kalman filtering, the extrinsic parameters are corrected, and the ORB feature point extraction is improved (500~800 feature points are extracted per image), which are then mapped to the reference viewpoint (camera at the northeast corner of the intersection), with the alignment error controlled within 1~2 pixels; S5: Calculate the feature missing rate (e.g., the missing rate of the vehicle occluded area is 45%), cross-view similarity (occluded area similarity is 0.4), and combine it with the depth image (depth difference is 0.3m) to mark the occluded area (e.g., the area where the vehicle occludes the pedestrian). S6: Shallow features are weighted by sharpness (0.7 for unoccluded viewpoints and 0.3 for occluded viewpoints), and deep features are concatenated and then filtered through ECA-Net (80% of channels are retained). 1×1 convolution is used to reduce the dimensionality to 512 channels. S7: Improved YOLOv8 output target category (pedestrian, car, electric vehicle), bounding box (error ≤ 3 pixels), confidence (≥ 0.6), improved NMS to correct occluded target boxes, and finally output detection results (22fps). 3. Implementation effect verification The test was conducted at this intersection for 72 consecutive hours (including morning rush hour, rainy days, and nighttime scenarios), and the results showed: The pedestrian detection accuracy is 96.2% (compared to 81.5% for existing single-view methods), and the false negative rate is 4.1% (compared to 18.3% for existing methods). The vehicle detection accuracy rate is 97.5% (90.2% with existing methods), and the false detection rate is 2.3% (7.8% with existing methods). Even in obstructed scenarios (such as three people obstructing each other or vehicles obstructing pedestrians), the detection accuracy remains above 92%, meeting the application requirements of intelligent traffic signal control and violation capture.

[0011] The present invention and its embodiments have been described above. This description is not restrictive, and the accompanying drawings are only one embodiment of the present invention; the actual structure is not limited thereto. In conclusion, if those skilled in the art are inspired by this description and design similar structures and embodiments without departing from the spirit of the invention, such designs should fall within the protection scope of the present invention.

Claims

1. An occluded target detection method based on multi-view fusion, characterized in that, Includes the following steps: (1) Multi-view image acquisition: At least two distributed image acquisition devices are used to achieve synchronous image acquisition through a time synchronization module with an acquisition frequency of not less than 25fps. The device internal and external parameters are acquired synchronously and the installation error is calibrated in real time. (2) Image preprocessing: The acquired images are sequentially subjected to adaptive median filtering for noise reduction with a filtering window of 3×3 to 7×7, perspective transformation distortion correction, and scale normalization. (3) In-view feature extraction and enhancement: Based on the improved ResNet50 network with added CBAM module, shallow texture features (C1 to C2 layers) and deep semantic features (C3 to C5 layers) are extracted. Features are optimized by adaptive weighting and edge enhancement based on Sobel operator, and feature dimensions are unified. (4) Cross-view feature alignment: Kalman filtering is used to dynamically correct extrinsic parameters. The improved ORB algorithm is used to match feature points and map them to the reference view coordinate system. After error verification with pixel deviation not exceeding 2 and interpolation completion, an aligned feature set is generated. (5) Occlusion region identification: Based on three multi-dimensional indicators, namely feature missing rate, cross-view feature similarity and optional depth difference, pixel-level marking and occlusion degree evaluation of occlusion region are realized. When the feature missing rate is not less than 40%, it is judged as potential occlusion; when the cross-view feature similarity is less than 0.6, it is judged as occlusion; and when the depth difference is less than 0.5m, it is used for auxiliary verification. (6) Multi-feature hierarchical fusion: shallow features are fused using occlusion perception attention weighting, and deep features are optimized by dimensionality reduction after channel splicing and ECA-Net filtering; (7) Target detection and optimization: The fused features are input into the improved YOLOv8 network with a dedicated occlusion branch. By improving NMS and bounding box correction, the target category, bounding box, confidence score of not less than 0.5 and occlusion status are output, and the detection frame rate is not less than 20fps.

2. An occlusion target detection system based on multi-view fusion, characterized in that, To implement the method of claim 1, comprising: The system comprises a multi-view image acquisition module, an image preprocessing module, a feature extraction and enhancement module, a cross-view feature alignment module, an occlusion region recognition module, a multi-feature fusion module, a target detection and control module, and a result output and storage module. The multi-view image acquisition module includes at least two high-definition cameras, a GPS synchronizer, and an extrinsic parameter calibrator. The image preprocessing module is hardware-accelerated using FPGA. The feature extraction and enhancement module utilizes an improved ResNet50 deployed on a GPU. The cross-view feature alignment module employs CPU and GPU collaborative computing. The multi-feature fusion module is accelerated using a GPU. The target detection and control module utilizes an improved YOLOv8 deployment. The result output and storage module supports JSON and XML format output with a storage rate of no less than 100MB / s.

3. The occlusion target detection method based on multi-view fusion according to claim 1, characterized in that: The time synchronization module in step (1) is a GPS synchronizer or hardware trigger module with a synchronization accuracy of no more than 1ms. The external parameter calibration is achieved by an external parameter calibrator and the calibration error does not exceed 0.5°.

4. The occlusion target detection method based on multi-view fusion according to claim 1, characterized in that: The feature point mapping in step (4) is achieved through the perspective projection matrix. After error verification, outliers are removed, and the missing areas are filled by bilinear interpolation.

5. The occlusion target detection method based on multi-view fusion according to claim 1, characterized in that: The degree of occlusion in step (5) is divided into mild, moderate and severe. Mild occlusion corresponds to a feature loss rate of 40% to 59%, moderate occlusion corresponds to a feature loss rate of 60% to 79%, and severe occlusion corresponds to a feature loss rate of no less than 80%. The occluded area is marked by a mask value.

6. The occlusion target detection method based on multi-view fusion according to claim 1, characterized in that: The improved NMS in step (7) dynamically adjusts the threshold by occlusion confidence. The higher the occlusion confidence, the lower the threshold by 0.1 to 0.2 to avoid missed detections.

7. The occlusion target detection system based on multi-view fusion according to claim 2, characterized in that: The image preprocessing module uses a Xilinx Zynq UltraScale+ series FPGA, while the feature extraction and fusion module uses an NVIDIA A100 series GPU or an RTX 4090 series GPU.

8. The occlusion target detection system based on multi-view fusion according to claim 2, characterized in that: The results output and storage module outputs the detection results through an Ethernet interface, while storing the original image and detection log. The log includes device parameters, occlusion assessment data, and detection confidence information.