Urban rail train autonomous obstacle detection method and system based on long-short focus image fusion
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING JIAOTONG UNIV
- Filing Date
- 2024-12-23
- Publication Date
- 2026-06-23
Smart Images

Figure CN119888679B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of obstacle detection and environmental perception technology for urban rail transit trains, specifically to an autonomous obstacle detection method and system for urban rail transit trains based on a long-focus and short-focus image fusion model. Background Technology
[0002] Urban rail transit is a high-capacity public transportation infrastructure and the backbone transportation network of large cities. Its safety and operational efficiency are related to the life, health, safety, and travel convenience of passengers. Active obstacle detection is an important research and development direction for realizing the automated operation of urban rail transit trains. It is of great significance for effectively detecting abnormal obstacles encroaching on the train's limits, improving the obstacle detection performance of driverless trains, and enhancing the onboard autonomous perception capabilities.
[0003] Considering the heavy weight, high inertia, and long braking distance of subway train cars, there are high requirements for obstacle detection range. Urban rail transit typically needs to detect abnormal intrusions affecting train movement within a range of at least 200 meters, while also effectively detecting larger obstacles within 100 meters. Since obstacle detection methods may be directly related to emergency braking requests, the obstacle detection algorithm must have high accuracy and a low false alarm rate.
[0004] Traditional obstacle detection methods primarily rely on manual labor or ground-based equipment. The former is inefficient and labor-intensive, while the latter has a limited detection range and high deployment costs for full-line monitoring coverage. Active obstacle detection for trains is a crucial research direction for achieving automated operation of urban rail transit trains and a key technology for improving train operation safety. Onboard active obstacle detection installs environmental perception sensors, such as visible light cameras, and computing resources on the vehicle to acquire and process images, enabling obstacle detection ahead of the train.
[0005] Single-camera systems present numerous challenges in capturing small obstacles ahead of a train. When shooting an image, a single camera can either optimize distant details or expand the field of view, but not both simultaneously. Long-distance shooting requires a telephoto lens, which sacrifices the field of view, while wide-angle shooting requires a short focal length lens, which damages distant details. Telephoto-short focal length image fusion refers to combining an image with sharp local details captured by a telephoto lens with an image of the overall scene captured by a short focal length lens, in order to preserve both the overall scene and local details.
[0006] In the application of autonomous obstacle detection for urban rail trains, considering the heavy weight, high inertia, and long braking distance of metro train cars, there are high requirements for obstacle detection range, long-distance small targets, and wide field of view. Urban rail transit typically needs to detect abnormal intrusions affecting train operation at a distance of at least 200 meters and measuring 40cm x 40cm. Since obstacle detection methods may be directly linked to emergency braking requests, in high-safety train transportation scenarios, a shorter detection range means that potential obstacles at a distance cannot be detected in time, significantly reducing the system's early warning capability and reaction time. Therefore, long-distance, small-target, and wide-field-of-view obstacle detection technologies are of great significance.
[0007] Active obstacle detection for trains is a crucial research direction for achieving automated operation of urban rail transit trains. It is of great significance for effectively detecting abnormal obstacles encroaching on train boundaries, improving the obstacle detection capabilities of driverless trains, and enhancing onboard autonomous perception capabilities. In acquiring images of distant, small obstacles ahead of the train, single cameras either focus on distant small targets or expand their field of view to focus on nearby targets, but cannot simultaneously handle both a wide field of view and the detection of distant, small obstacles. Long-distance shooting requires telephoto lenses, which sacrifices the field of view, while wide-angle shooting requires short focal length lenses, resulting in the loss of distant details. Summary of the Invention
[0008] The purpose of this invention is to provide an autonomous obstacle detection method and system for urban rail trains based on a long-focus and short-focus image fusion model, so as to solve at least one of the technical problems existing in the background art.
[0009] To achieve the above objectives, the present invention adopts the following technical solution:
[0010] In a first aspect, the present invention provides an autonomous obstacle detection method for urban rail trains, comprising:
[0011] Acquire telephoto and short-focus images captured in the urban rail train scene;
[0012] A pre-trained long-range small target detection model is used to process acquired telephoto and short-focus images to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. It also fuses the features of the telephoto and short-focus images, using the complete small target feature information from the short-focus image to supplement the small target feature information where some features are missing at the edges of the telephoto image. The classification unit processes the feature-fused image to obtain the long-range small target detection results.
[0013] The short focal length image is processed using a short focal length wide field of view target detection model to obtain the target detection results of the short focal length wide field of view image;
[0014] The target detection results of short focal length wide field of view images are fused with the target detection results of distant small targets at the decision level to obtain the final target detection result.
[0015] As a further limitation of the first aspect, the target detection results of the short focal length wide field of view image are fused with the decision-level data to obtain the final target detection results, including: applying non-maximum suppression to the detection results of distant small targets to eliminate redundancy and duplication in the detection results; NMS removes detection boxes with low confidence in the short focal length image by comparing the confidence of the detection boxes in the overlapping areas, and retains the most representative target boxes; using non-maximum suppression, the same target detected in the long and short focal length images is filtered according to the confidence of the target box, and the detection boxes with higher confidence are retained.
[0016] As a further limitation of the first aspect, feature fusion is performed between telephoto and short-focus images. The complete feature information of small targets in the short-focus image is used to supplement the feature information of small targets that are partially missing at the edge of the telephoto image. This includes: introducing a self-attention mechanism to calculate the correlation of different positions in the feature sequence, dynamically allocating weights, focusing on small target regions with missing features in the telephoto image, and extracting relevant information from the short-focus image to supplement them; and introducing a positional encoding mechanism to embed spatial position information into the feature sequence, determining the source and spatial position of each feature, and realizing feature alignment and integration between modalities.
[0017] As a further limitation of the first aspect, convolutional feature extraction is performed on telephoto and short-focus images respectively to obtain their feature maps; the features of the feature maps of the telephoto and short-focus images are concatenated and positionally encoded; the concatenated and positionally encoded features are divided into multiple non-overlapping local windows using a sliding window, and the local window attention is calculated, the self-attention weight matrix is calculated, and the value matrix is weighted using the self-attention weights to obtain the updated feature representation; the standard multi-head self-attention of the shifted window is calculated again, the result is recalibrated to the same feature image as the input, and it is added as supplementary information to the original modality branch; multi-scale feature fusion is achieved using a weighted bidirectional feature pyramid network, and finally, multi-size feature maps are obtained.
[0018] As a further limitation of the first aspect, in the classification unit, all detected target boxes are sorted according to their confidence levels, with the bounding box with the highest confidence level at the front; starting from the bounding box with the highest confidence level, the remaining bounding boxes are traversed in turn; for the currently traversed bounding box, if its overlap with the previously retained bounding boxes is greater than a certain threshold, then it is suppressed and not retained; the next target box is traversed, and the above process is repeated until all target boxes have been processed; the target boxes that remain at the end are the target boxes of distant small targets captured by the telephoto image, which incorporates the short-focus image to supplement the missing edge feature information of the telephoto image.
[0019] As a further limitation of the first aspect, the edge regions of the telephoto image are padded with several pixels of 0 to complete the small target information where some features are missing at the edge of the telephoto image. This includes: filling the edge regions of the telephoto image with several pixels of 0. Fill with pixels of 0; assuming the original image is The size is The filled image is The size is ,in It is determined based on the size of the small target being identified.
[0020] Secondly, the present invention provides an autonomous obstacle detection system for urban rail trains, comprising:
[0021] The acquisition module is used to acquire telephoto and short-focus images captured in urban rail train scenes;
[0022] The first detection module processes the acquired telephoto and short-focus images using a pre-trained long-range small target detection model to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. It also fuses the features of the telephoto and short-focus images, using the complete small target feature information from the short-focus image to supplement the small target feature information where some features are missing at the edges of the telephoto image. The classification unit processes the feature-fused image to obtain the long-range small target detection results.
[0023] The second detection module is used to process the short focal length image using a short focal length wide field of view target detection model to obtain the target detection result of the short focal length wide field of view image.
[0024] The fusion module is used to perform decision-level fusion of target detection results from short focal length wide field of view images with long-distance small target detection results to obtain the final target detection result.
[0025] Thirdly, the present invention provides a non-transitory computer-readable storage medium for storing computer instructions, which, when executed by a processor, implement the autonomous obstacle detection method for urban rail trains as described in the first aspect.
[0026] Fourthly, the present invention provides a computer device including a memory and a processor, wherein the processor and the memory communicate with each other, the memory stores program instructions executable by the processor, and the processor invokes the program instructions to execute the autonomous obstacle detection method for urban rail trains as described in the first aspect.
[0027] Fifthly, the present invention provides an electronic device comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, the computer program is stored in the memory, and when the electronic device is running, the processor executes the computer program stored in the memory to cause the electronic device to execute instructions for implementing the autonomous obstacle detection method for urban rail trains as described in the first aspect.
[0028] The beneficial effects of this invention are: by making full use of the complementary information of the images after long and short focal length fusion, it improves the detection capability of small targets at long distances at the boundaries of long and short focal length images, realizes the detection of obstacles at long distances with a large field of view, and provides technical support for enhancing the autonomous perception capability of urban rail vehicles.
[0029] The advantages of additional aspects of the invention will be set forth more clearly in the following description or will be learned by practice of the invention. Attached Figure Description
[0030] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0031] Figure 1 This is a flowchart of the autonomous obstacle detection method for urban rail trains based on a long-focus and short-focus image fusion model, as described in an embodiment of the present invention.
[0032] Figure 2 This is a diagram showing the detection results of a short-focus camera in a subway tunnel according to an embodiment of the present invention.
[0033] Figure 3 This is a detection result diagram of long-short focal length fusion in the same field of view in a subway tunnel according to an embodiment of the present invention.
[0034] Figure 4 This is a diagram showing the detection results of small targets in a subway tunnel using long and short focal length fusion with a wide field of view, as described in an embodiment of the present invention. Detailed Implementation
[0035] Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0036] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0037] It should also be understood that terms such as those defined in general dictionaries should be understood to have meanings consistent with their meanings in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless defined as here.
[0038] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, and / or groups thereof.
[0039] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of those different embodiments or examples.
[0040] To facilitate understanding of the present invention, the present invention will be further explained and described below with reference to the accompanying drawings and specific embodiments. However, the specific embodiments do not constitute a limitation on the embodiments of the present invention.
[0041] Those skilled in the art should understand that the accompanying drawings are merely schematic diagrams of embodiments, and the components in the drawings are not necessarily essential for implementing the present invention.
[0042] This invention proposes a long- and short-focus image fusion network model based on the SwinTrans model (SwinTransformer Fuse with Different Focal Length Cameras, STFDFL). It fully utilizes the complementary information of the fused long- and short-focus images, improving the detection capability of distant small targets at the boundaries of long- and short-focus images. This enables the detection of distant small target obstacles with a large field of view, providing technical support for enhancing the autonomous perception capabilities of urban rail vehicles. This invention combines images from long- and short-focus cameras. For the same field of view of both cameras, pixel padding of the long-focus image avoids the incompleteness of distant small targets. It combines the complementary information provided by bidirectional interpolation of the short-focus image with the same field of view. The proposed long- and short-focus dual-path fusion network STFDFL based on the SwinTransformer model achieves information fusion within the same field of view of both cameras, improving the detection accuracy of small targets at distance, especially those with incomplete edges. Furthermore, compared to the computational complexity of the traditional Transformer model, which is proportional to the square of the image size, the SwinTransformer model achieves computational complexity linear with the image size, effectively improving the computational efficiency of the algorithm. Finally, by combining the large field of view advantage of the short-focal-length camera, the vehicle-mounted autonomous obstacle detection of small targets at long distances with a wide field of view is achieved, meeting the requirements for abnormal obstacle detection at a distance of not less than 200 meters and small target detection of 40cm*40cm, providing support for improving the efficiency and accuracy of autonomous obstacle detection for urban rail trains. This invention utilizes a long-focal-length camera to achieve long-distance, small target detection and recognition, and a short-focal-length camera to achieve close-range, wide-field-of-view target detection and recognition. For the same field of view of the long and short focal length cameras, the detection accuracy of distant, incomplete small targets at the image boundary is improved by filling the image pixels and fusing the Swin Tranformer multi-sensor attention mechanism. The method includes: (1) Long-distance small target detection and recognition; In order to make more effective use of the image information of the long and short focal length cameras for the same area from different perspectives (due to the different camera installation positions), bilinear interpolation is used to interpolate and enlarge the same area in the short-focal-length camera image to the same size as the long-focal-length camera. For the problem of detecting distant, incomplete small targets at the image boundary, the long-focal-length camera is filled to make it include complete targets.Combining the long and short focal length dual-path fusion network based on SwinTransformer, the complete small target feature information of the short focal length image is used to supplement the small target feature information that is partially missing at the edge of the long focal length image, so as to realize the information fusion of the images of the same area with long and short focal lengths and the detection of long-distance small targets, especially the detection of incomplete small targets at a distance at the image boundary; (2) Short-distance wide field of view target detection and recognition; using common deep learning algorithms, such as SSD or YOLOvx, to complete the target recognition of short focal length camera images; (3) Long and short focal length camera detection fusion; combining the long-distance small target detection of long focal length images and the wide field of view advantage of short focal length cameras, to achieve accurate detection of long-distance small targets with a large field of view, to meet the requirements of abnormal intrusion obstacle detection of no less than 200 meters and small target detection of 40cm*40cm, and to provide a reference for improving the efficiency and accuracy of autonomous obstacle detection of urban rail trains.
[0043] Example 1
[0044] In this embodiment 1, an autonomous obstacle detection system for urban rail trains is provided, comprising: an acquisition module for acquiring telephoto and short-focus images collected in an urban rail train scene; a first detection module for processing the acquired telephoto and short-focus images using a pre-trained long-range small target detection model to obtain long-range small target detection results; wherein, the long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit; the preprocessing unit is used to perform bilinear interpolation magnification preprocessing on the short-focus image; the registration unit is used to obtain the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method; the supplementation unit is used to fill the edge regions of the telephoto image with several pixels of 0 to complete the small target information with partially missing features at the edge of the telephoto image, and to perform feature fusion between the telephoto and short-focus images, supplementing the small target feature information with partially missing features at the edge of the telephoto image with the complete small target feature information of the short-focus image; the classification unit is used to process the feature-fused image to obtain the long-range small target detection results. The second detection module processes the short-focal-length image using a short-focal-length wide-field-of-view target detection model to obtain the target detection result of the short-focal-length wide-field-of-view image. The fusion module performs decision-level fusion of the target detection result of the short-focal-length wide-field-of-view image with the detection result of distant small targets to obtain the final target detection result.
[0045] In this embodiment, the above-described system is used to implement an autonomous obstacle detection method for urban rail trains, including: acquiring telephoto and short-focus images of the urban rail train scene using an acquisition module; processing the acquired telephoto and short-focus images using a pre-trained long-range small target detection model based on a first detection module to obtain long-range small target detection results; wherein, the long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit; the preprocessing unit is used to perform bilinear interpolation magnification preprocessing on the short-focus image; the registration unit is used to obtain the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method; the supplementation unit is used to fill the edge regions of the telephoto image with several pixels of 0 to complete the small target information with partially missing features at the edge of the telephoto image, and to perform feature fusion between the telephoto and short-focus images, supplementing the small target feature information with partially missing features at the edge of the telephoto image with the complete small target feature information of the short-focus image; the classification unit is used to process the image after feature fusion to obtain long-range small target detection results. Based on the second detection module, the short-focal-length image is processed using a short-focal-length wide-field-of-view target detection model to obtain the target detection result of the short-focal-length wide-field-of-view image. A fusion module is then used to perform decision-level fusion of the target detection result of the short-focal-length wide-field-of-view image with the detection result of distant small targets to obtain the final target detection result.
[0046] In this embodiment, the target detection results of the short focal length wide field of view image and the long-distance small target detection results are fused at the decision level to obtain the final target detection result. This includes: applying non-maximum suppression to the long-distance small target detection results to eliminate redundancy and duplication in the detection results; NMS removes detection boxes with low confidence in the short focal length image by comparing the confidence of the detection boxes in the overlapping areas, and retains the most representative target boxes; using non-maximum suppression, the same target detected in the long and short focal length images is filtered according to the confidence of the target box, and the detection boxes with higher confidence are left.
[0047] In this embodiment, feature fusion is performed between telephoto and short-focus images. The complete feature information of small targets in the short-focus image is used to supplement the feature information of small targets that are partially missing at the edges of the telephoto image. This includes: using a self-attention mechanism to achieve long-distance dependency modeling and multimodal feature interaction, better handling global information fusion for long-distance dependencies; the self-attention mechanism calculates the correlation between different positions in the feature sequence, dynamically assigns weights, focuses on small target regions with missing features in the telephoto image, and extracts relevant information from the short-focus image to supplement them. Furthermore, by introducing a positional encoding mechanism to embed spatial location information into the feature sequence, the model can accurately perceive the source and spatial location of each feature, thereby achieving efficient feature alignment and integration between modalities.
[0048] In this embodiment, the feature fusion process of the telephoto image and the short-focus image specifically includes:
[0049] The first step is to perform convolutional feature extraction on the telephoto and short-focus images respectively to obtain their feature maps. and The size of these feature maps is ,in and These are the height and width of the image. It refers to the number of channels.
[0050] The second step involves concatenating and positionally encoding the feature maps of the telephoto and short-focal-length images. Specifically, to combine the features of images with different focal lengths, the feature sequences of the two images are... and spliced into a new sequence : Among them, concat is a concatenation operation. Represents the total number of pixels in the image. This represents the number of feature channels for each pixel. Position encoding is added to embed the location of each feature. Its size is the same as the spliced sequence. Same. Position encoding This is to ensure that the Swing Transformer can perceive the spatial information of each location. Ultimately, the image after adding position encoding is... .
[0051] The third step involves dividing the concatenated and positionally encoded features into multiple non-overlapping local windows using a sliding window. Local window attention is calculated, and a self-attention weight matrix is computed. This self-attention weight is then used to weight the value matrix, resulting in the updated feature representation. Specifically, an N*N sliding window is used to divide the input into multiple non-overlapping local windows. For each local window, calculate its local window attention, where... This refers to the size of a partial sliding window. , , , , in, It is the weight matrix obtained from model learning and training. This refers to the dimensions of query, key, and value. Next, the self-attention weight matrix is calculated. , This represents the correlation between every two positions in the local window input feature sequence. In this process, the weights are normalized so that the attention distribution at each location reflects its attention to other locations. Then, self-attention weights are used. value matrix We perform weighting to obtain the updated feature representation. This process assigns different weights to each location, thereby enhancing the features of small target regions in telephoto images while supplementing detailed information in short-focus images.
[0052] The fourth step involves recalculating the standard multi-head self-attention of the shifted window, recalibrating the result to the same feature image as the input, and adding it as supplementary information to the original modality branch. Multi-scale feature fusion is then achieved using a weighted bidirectional feature pyramid network to finally obtain multi-size feature maps.
[0053] In this embodiment, the classification unit includes: sorting all detected target boxes according to their confidence levels, with the bounding box with the highest confidence level at the top; starting from the bounding box with the highest confidence level, traversing the remaining bounding boxes in sequence; for the currently traversed bounding box, if its overlap with the previously retained bounding boxes (i.e., the calculated IOU value) is greater than a certain threshold, then it is suppressed and not retained; continuing to traverse the next target box, repeating the above process, until all target boxes have been processed; the last remaining target box is the target box of the distant small target captured by the telephoto image, which incorporates the short-focus image to supplement the missing edge feature information of the telephoto image.
[0054] The registration unit is used to obtain the registration regions of the telephoto and short-focus images based on the normalized cross-correlation matching method, specifically including:
[0055] The Normalized Cross-correlation Matching Algorithm (NCC) method is used to obtain the registration regions for long and short focal length images. The information cross-correlation algorithm first preprocesses the images, such as denoising or grayscale conversion. After preprocessing, NCC calculates the mean of the information values of the images. After calculating the means of the two images separately, the correlation between the images is calculated using a sliding window method. To eliminate the influence of absolute intensity values on the cross-correlation value, the cross-correlation is normalized in the NCC algorithm. After comparing the cross-correlation values of the two images in different regions, the cross-correlation values are compared with a preset threshold. Cross-correlation values exceeding the threshold are selected, and the region with the largest cross-correlation value is determined by a maximum value comparison method. This region is determined as the optimal matching position, which is the short focal length image registration region to be extracted.
[0056] In this embodiment, the process is performed in the edge region of the telephoto image. Fill with pixels of 0; assuming the original image is The size is The filled image is The size is ,in It is determined based on the size of the small target being identified.
[0057] Example 2
[0058] like Figure 1 As shown in Embodiment 2, an autonomous obstacle detection method for urban rail trains based on a dual-path fusion network (STFDFL) is proposed. For long-range small target detection, the short-focus image is first enlarged using bilinear interpolation based on the focal length ratio of the long and short-focus cameras, and then registered with the long-focus image. Finally, the matching region in the short-focus image is extracted as one input image through image cropping. Although long-focus cameras can form high-resolution images of long-range small targets, their field of view is narrow, resulting in significant feature loss when the small target is at the edge of the long-focus image. Therefore, this embodiment proposes pixel-filling the edges of the long-focus image. Pixel-filling maintains the integrity of the small target in the long-focus image, avoiding complete loss of image areas. Even if some areas of the long-range small target have information loss, pixel-filling ensures that the entire small target image remains a complete matrix format. Finally, the Swin Transformer module is used to achieve image feature fusion in the long and short focal length matching regions. The complete small target feature information from the short focal length image is used to supplement the small target feature information that is partially missing at the edge of the long focal length image. This dual-path image information fusion enables the detection of abnormal intrusion obstacles at least 200 meters in front of the train and the detection of small targets as small as 40cm*40cm. Furthermore, for short-range wide-field-of-view target detection, deep learning algorithms such as SSD or YOLOvx are employed. Figure 1 (BACKBONE+NECK+HEAD). Ultimately, by combining decision-level fusion of long-range detection from telephoto images with the wide field of view of short-focus cameras, accurate detection of small targets at long distances with a wide field of view is achieved, meeting the detection requirements for small targets of 40cm*40cm at a distance of 200 meters, and significantly improving detection efficiency and accuracy.
[0059] like Figure 1As shown in this embodiment, the autonomous obstacle detection method for urban rail trains based on a dual-path fusion network (STFDFL) includes the following steps: First, for the acquired long-focus and short-focus images, the SwinTransformer model (a long-range small target detection model) is used to achieve a combined long-focus and short-focus image long-range small target detection result. Second, for the acquired short-focus image, a deep learning algorithm such as SSD or YOLOvx (i.e., a short-focus wide-field-of-view target detection model) is used to achieve a short-range wide-field-of-view target detection result. Finally, a fusion module is used to perform decision-level fusion of the long-range small target detection result and the short-range wide-field-of-view target detection result to achieve accurate detection of wide-angle, long-range small targets.
[0060] Specifically, the Swin Transformer model includes a preprocessing unit, which performs bilinear interpolation magnification preprocessing on short-focal-length images (bilinear interpolation magnification of the matching region). Due to the significant differences in imaging capabilities between lenses of different focal lengths, short-focal-length images can only be magnified using bilinear interpolation based on the focal length ratio, as shown in formula (4-1):
[0061] (4-1)
[0062] in For telephoto, The focal length is short. Then, based on the obtained focal length ratio, the magnified target size of the short focal length image is obtained. Then, the short focal length image is magnified by bilinear interpolation. The specific bilinear interpolation method is shown in the following formula (4-2).
[0063] (4-2)
[0064] The target point is , is the point to be interpolated, located on non-integer coordinates of the original image, and its four surrounding pixels are: Top left pixel, value ; The top right pixel has a value of ; The bottom left pixel has a value of ; The bottom right pixel has a value of ;in, and These pixels form a rectangular area, the target point The pixel values are located inside the rectangle and are calculated through interpolation. As can be seen from Equation (4-2), the pixel values closer to the target point have a greater impact on the result of bilinear interpolation, ensuring that the features of the image are not distorted after interpolation and magnification.
[0065] The Swin Transformer model also includes a registration unit, which obtains the registration regions of long and short focal length images based on the Normalized Cross-correlation Matching Algorithm (NCC). The information cross-correlation algorithm first performs preprocessing on the image, such as denoising or grayscale conversion, while NCC calculates the mean of the information values of the image after preprocessing, as shown in formula (4-3).
[0066] (4-3)
[0067] in, It is the average pixel value of the image. It is the pixel value of the image at (x, y). It is the width of the image. This represents the height of the image. After calculating the mean of each of the two images, the correlation between the images is calculated using a sliding window method, as shown in formula (4-4).
[0068] (4-4)
[0069] in, It is the cross-correlation value. It is the width of the template image. It is the height of the template image. The template image is in the background image. Directional position offset, range of values ; The template image is in the background image. Directional position offset, range of values .
[0070] In order to eliminate the influence of absolute strength values on cross-correlation values in the NCC algorithm, the cross-correlation is normalized, as shown in formula (4-5).
[0071] (4-5)
[0072] in, This is the normalized cross-correlation value. After comparing the cross-correlation values of two images in different regions, the cross-correlation values are compared with a preset threshold. Those cross-correlation values exceeding the threshold are selected, and the region with the largest cross-correlation value is determined by the maximum value comparison method. This region is determined as the optimal matching position, which is the short-focus image registration region to be extracted.
[0073] The Swin Transformer model also includes a supplementary unit, which fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information that is partially missing at the edge of the telephoto image. It fuses the features of the telephoto image and the short-focus image, and uses the complete small target feature information of the short-focus image to supplement the small target feature information that is partially missing at the edge of the telephoto image.
[0074] In this process, the edge regions of the telephoto image are filled with several pixels of 0 (the size of the fill depends on the size of the small target), such as... Figure 1 The edge padding shown is used to supplement information about small targets with partially missing features at the edges of telephoto images. The problem of missing features in edge regions is particularly prominent when detecting small obstacles in front of a track in a telephoto image. Due to the inherent characteristics of telephoto lenses, details at image edges are often blurred or lost, posing a challenge to the recognition of small targets. To address this issue, the supplementary unit of this embodiment proposes an effective image preprocessing method: performing edge padding on the edge regions of the telephoto image... Padding with 0 pixels. Assume the original image is The size is The filled image is The size is ,in It is determined based on the size of the identified small target. Specifically, it is shown in formulas (4-6) to (4-7).
[0075] (4-6)
[0076] (4-7)
[0077] Formula (4-6) represents the original image matrix, and formula (4-7) represents the filled image matrix table, where the filled area... The pixel value is 0. The fill operation can be described by the following formula (4-8):
[0078] (4-8)
[0079] The pixel values in the filled areas are 0, meaning that the pixel values in the filled areas at the top, bottom, left, and right sides of the image are all 0; the pixel values in the original image areas remain unchanged, indicating that the filled image... In the image, excluding the filled area, the pixel values are the same as those of the original image. The corresponding pixel values are the same.
[0080] The core idea of this pixel-filling method is to artificially supplement the feature information missing due to optical system limitations by expanding the edge areas of the image. Specifically, in this embodiment, pixels are added to the top, bottom, left, and right sides of the image. A padding region of several pixels wide is created, and the pixel values in these regions are set to 0. This operation not only expands the effective range of the image, ensuring that no additional noise is introduced and maintaining the integrity of the original image data, but also provides more contextual information for subsequent obstacle detection algorithms, thereby improving the ability to recognize small targets in the edge regions of telephoto images.
[0081] The supplementary unit also implements feature fusion of long-focus and short-focus images. Specifically, based on the SwinTransformer architecture, feature fusion is performed between long-focus and short-focus images. The complete small target feature information from the short-focus image supplements the feature information of small targets that are partially missing at the edges of the long-focus image, achieving fusion of the same region in both long-focus and short-focus images and detection of small targets at a distance. In the long-focus and short-focus image feature fusion part, the introduction of SwinTransformer aims to achieve long-distance dependency modeling and multimodal feature interaction through a self-attention mechanism, better handling the fusion of global information with long-distance dependencies. The self-attention mechanism calculates the correlation at different positions in the feature sequence, dynamically assigns weights, focuses on small target regions with missing features in the long-focus image, and extracts relevant information from the short-focus image to supplement them. Figure 1 In this model, the main function of the head layer is to map the features processed by the neck layer to the final output space, generating the network's final prediction result. The head layer is the topmost part of the model and is typically a classifier or regressor responsible for performing specific tasks such as classification, object detection, or semantic segmentation. The structure of the head layer varies depending on the task. For example, in image classification tasks, the head layer might use a softmax classifier; in object detection tasks, the head layer might contain bounding box regressors and classifiers. Furthermore, the Swin Transformer's positional encoding mechanism embeds spatial location information into the feature sequence, enabling the model to accurately perceive the source and spatial location of each feature, thereby achieving efficient feature alignment and integration between modalities. The specific process of this step is shown below:
[0082] The first step is to extract and preprocess features from both telephoto and short-focus images. First, convolutional feature extraction is performed on both the telephoto and short-focus images to obtain their feature maps. and The size of these feature maps is ,in and These are the height and width of the image. It refers to the number of channels.
[0083] The second step involves concatenating and encoding the image features at their locations. To combine features from two images with different focal lengths, the feature sequences from the two images are... and spliced into a new sequence As shown in the following formula (4-9):
[0084] (4-9)
[0085] in, It is a splicing operation. Represents the total number of pixels in the image. This represents the number of feature channels for each pixel. Position encoding is added to embed the location of each feature. Its size is the same as the spliced sequence. Same. Position encoding This is to ensure that the Swin Transformer can perceive the spatial information of each location. Finally, the input after adding location encoding is shown in formula (4-10):
[0086] (4-10)
[0087] in, It is the input to the Swing Transformer, and it will contain all the features and location information.
[0088] The third step is to use an N*N sliding window to divide the input into non-overlapping segments. For each local window, calculate its local window attention, where This refers to the size of a partial sliding window. The calculation of its Q, K, and V is shown in the following formulas (4-11)-(4-13):
[0089] (4-11)
[0090] (4-12)
[0091] (4-13)
[0092] in, It is the weight matrix obtained from model learning and training. This refers to the dimensions of query, key, and value. Next, the self-attention weight matrix is calculated. As shown in the following formula (4-14):
[0093] (4-14)
[0094] here, This represents the correlation between every two positions in the local window input feature sequence. In this process, the weights are normalized so that the attention distribution at each location reflects its attention to other locations. Then, self-attention weights are used. value matrix We perform weighting to obtain the updated feature representation. As shown in the following formula (4-15):
[0095] (4-15)
[0096] This process assigns different weights to each location, thereby enhancing the features of small target regions in telephoto images while supplementing detailed information in short-focus images.
[0097] In the fourth step, the Swin Transformer layer recalculates the standard multi-head self-attention (MSA) of the shifted window, which consists of window-based self-attention (W-MSA) and moving window multi-head attention (SWMSA), followed by a multilayer perceptron (MLP) with Gaussian error linear units (GELU) for nonlinearity. LayerNorm layers are applied before both MSA and NLP, and each module uses residual connections to construct the Transformer's multi-head attention mechanism. Finally, the result is recalibrated to the same feature image as the input and added as supplementary information to the original modality branch. Then, BiFPN is used to implement multi-scale feature fusion, ultimately obtaining multi-scale feature maps. Figure 1 Middle NECK layer.
[0098] The Swing Transformer model in this embodiment also includes a classification unit (such as...). Figure 1 The HEAD layer uses Non-Maximum Suppression (NMS) to process the detection results. See the HEAD module for details. This ensures that the final obstacle detection effect is clearly displayed on the telephoto image. The specific process of NMS is as follows: (1) Sort all detected target boxes according to their confidence level, with the bounding box with the highest confidence level at the front. (2) Starting from the bounding box with the highest confidence level, traverse the remaining bounding boxes in turn. (3) For the currently traversed bounding box, if its overlap with the previously retained bounding boxes (i.e., the IOU value calculated using the following formulas (4-16) and (4-17)) is greater than a certain threshold, then it is suppressed.
[0099] (4-16)
[0100] (4-17)
[0101] in, Let be the intersection area of the target bounding boxes of the telephoto and short-focus images. The normalized width of the target bounding box in the short-focus image. The normalized height of the target bounding box in the short-focus image. The normalized width of the target bounding box in the telephoto image. The normalized height of the target bounding box in the telephoto image. It is the intersection-union ratio (IOU) of the target boxes of the telephoto and short-focus images. If the IOU is greater than a certain threshold (e.g., 0.5), then it is suppressed and not retained. (4) Continue to traverse the next target box and repeat the above process until all target boxes have been processed. The target box that is left at the end is the target box of the distant small target captured by the telephoto image, which is fused with the short-focus image to supplement the missing edge feature information of the telephoto image.
[0102] Finally, in this embodiment, the target detection results of the short focal length wide field of view image based on the YOLOvx target detection model are fused at the decision level with the target detection results of the long focal length long field of view image mentioned above through the fusion module. First, to eliminate redundancy and duplication in the detection results, non-maximum suppression (NMS) is also applied to the detection results of the short focal length field of view. NMS compares the confidence of the detection boxes in the overlapping areas, removes the detection boxes with low confidence in the short focal length image, and retains the most representative target boxes. Then, the detection results of long-distance small targets obtained based on the long focal length image in the previous step are fused at the decision level with the detection results of the short focal length wide field of view image. Using the principle of non-maximum suppression (NMS), the same target detected in the long and short focal length images is filtered according to the confidence of the target box, leaving the detection boxes with higher confidence. In this step, the results of the short focal length image detection are fused to compensate for the blind spots in the detection of obstacles outside the long focal length lens caused by the narrow field of view of the long focal length camera.
[0103] In summary, the autonomous obstacle detection method for urban rail trains based on a long- and short-focus image fusion model provided in Embodiment 2 effectively utilizes image information from different perspectives of the same area from both long- and short-focus cameras (due to different camera installation positions). It obtains the same area from the short-focus camera image based on a normalized cross-correlation matching method. For the short-focus area image, bilinear interpolation is used to enlarge it to the same size as the long-focus camera. A specified number of pixels are filled at the edges of the long-focus image to ensure the integrity of the small target image. Even if some area information is lost, the overall matrix format is maintained, avoiding complete loss of image areas. Based on the SwinTransformer long- and short-focus dual-path fusion network, it achieves the fusion of images of the same area at both long and short focal lengths and the fusion and detection of long-distance small target information. This solves the problem of missed obstacle detection caused by incomplete small target features in the edge areas of the long-focus image, effectively completing the detection of abnormal intrusion obstacles at a distance of no less than 200 meters and the detection of small targets as small as 40cm*40cm. This decision-level fusion of short-focus wide-field-of-view target detection and long-focus small-target detection, based on the YOLOvx model, applies non-maximum suppression (NMS) to the detection results of the short-focus image to remove redundancy and retain high-confidence bounding boxes. Subsequently, combining the NMS principle, the same target in both the short and long-focus images is filtered according to the bounding box confidence, selecting the detection box with higher confidence. This fusion strategy effectively compensates for the detection blind spots caused by the narrow field of view of long-focus cameras, ensuring the ability to detect obstacles at long distances with a wide field of view.
[0104] The method described in this embodiment was tested for target detection and recognition on a real dataset of subway tunnels. It effectively detected a moving pedestrian at a distance of 200 meters, a 40cm x 40cm box, and incomplete, distant small targets (boxes) with missing features. Figures 2 to 4 As shown, the detection accuracy (True Positive) at a distance of 150m is 85%, 78%, and 82%, respectively, which basically meets the needs of autonomous obstacle target recognition in urban rail transit.
[0105] Example 3
[0106] This embodiment 3 provides a non-transitory computer-readable storage medium for storing computer instructions. When executed by a processor, the computer instructions implement the autonomous obstacle detection method for urban rail trains as described above. The method includes:
[0107] Acquire telephoto and short-focus images captured in the urban rail train scene;
[0108] A pre-trained long-range small target detection model is used to process acquired telephoto and short-focus images to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. It also fuses the features of the telephoto and short-focus images, using the complete small target feature information from the short-focus image to supplement the small target feature information where some features are missing at the edges of the telephoto image. The classification unit processes the feature-fused image to obtain the long-range small target detection results.
[0109] The short focal length image is processed using a short focal length wide field of view target detection model to obtain the target detection results of the short focal length wide field of view image;
[0110] The target detection results of short focal length wide field of view images are fused with the target detection results of distant small targets at the decision level to obtain the final target detection result.
[0111] Example 4
[0112] This embodiment 4 provides a computer device, including a memory and a processor, wherein the processor and the memory communicate with each other, and the memory stores program instructions that can be executed by the processor. The processor calls the program instructions to execute the autonomous obstacle detection method for urban rail trains as described above, the method including:
[0113] Acquire telephoto and short-focus images captured in the urban rail train scene;
[0114] A pre-trained long-range small target detection model is used to process acquired telephoto and short-focus images to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. It also fuses the features of the telephoto and short-focus images, using the complete small target feature information from the short-focus image to supplement the small target feature information where some features are missing at the edges of the telephoto image. The classification unit processes the feature-fused image to obtain the long-range small target detection results.
[0115] The short focal length image is processed using a short focal length wide field of view target detection model to obtain the target detection results of the short focal length wide field of view image;
[0116] The target detection results of short focal length wide field of view images are fused with the target detection results of distant small targets at the decision level to obtain the final target detection result.
[0117] Example 5
[0118] This embodiment 5 provides an electronic device, including: a processor, a memory, and a computer program; wherein, the processor is connected to the memory, and the computer program is stored in the memory. When the electronic device is running, the processor executes the computer program stored in the memory to cause the electronic device to execute instructions for implementing the autonomous obstacle detection method for urban rail trains as described above, the method including:
[0119] Acquire telephoto and short-focus images captured in the urban rail train scene;
[0120] A pre-trained long-range small target detection model is used to process acquired telephoto and short-focus images to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions of the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. It also fuses the features of the telephoto and short-focus images, using the complete small target feature information from the short-focus image to supplement the small target feature information where some features are missing at the edges of the telephoto image. The classification unit processes the feature-fused image to obtain the long-range small target detection results.
[0121] The short focal length image is processed using a short focal length wide field of view target detection model to obtain the target detection results of the short focal length wide field of view image;
[0122] The target detection results of short focal length wide field of view images are fused with the target detection results of distant small targets at the decision level to obtain the final target detection result.
[0123] In summary, the method provided by this invention solves, to the greatest extent possible, the two major problems of short detection distance and narrow detection field of view in obstacle detection technology based on long and short focal length images. By utilizing long focal length images for the detection of small targets at a distance, the system's early warning capability is greatly improved and the reaction time is reduced. For example, when a train is traveling at high speed, if the obstacle detection system can detect obstacles at a sufficiently far distance in advance, it can provide the train control system with ample time to implement braking or take other evasive measures, which undoubtedly reduces the risk of accidents. First, a novel long focal length image edge pixel filling method is introduced. By filling the image edges with several pixels with a pixel value of 0, the problem of edge feature loss caused by the narrow field of view of long focal length cameras is solved, ensuring the integrity of the small target image. Even if information in some areas of a small target at a distance is lost, the overall matrix format can be maintained by filling, enhancing the reliability of detection. The bilinear interpolation of the same field of view region of the near focal length image is used to supplement incomplete small target information, avoiding the problem of missing small target features caused by ignoring image edges when using long focal length images for distant target detection. Secondly, a dual-channel image fusion method for long-distance small targets is proposed. This method utilizes bilinear interpolation to upscale short-focal-length camera images to match the resolution of telephoto cameras, enabling detailed reconstruction of the same region. This upscaled image is then combined with the telephoto image and deeply fused using a dual-channel fusion network based on the SwinTransformer. This network leverages the characteristics of the SwinTransformer to achieve the fusion of images of the same region at both long and short focal lengths and to detect long-distance small targets. This effectively solves the problem of missed obstacle detection caused by the lack of small target features in the edge regions of telephoto images, while simultaneously reducing computational complexity to a linear ratio with image size. This dual-channel fusion strategy not only significantly improves the accuracy of long-distance small target detection but also effectively avoids the problem of traditional methods relying solely on a single viewpoint from either a telephoto or short-focal-length camera. Traditional methods are limited by the narrow field of view of telephoto cameras when dealing with distant targets, while relying on the wide-angle characteristics of short-focal-length cameras for detecting close-range targets, resulting in insufficient fusion and utilization of image information within the same scene. Through the above strategy, efficient overall capture of long-distance small targets is achieved, enhancing the comprehensiveness and accuracy of the detection system. Finally, for the detection and recognition of short-range wide-field-of-view targets, mature deep learning algorithms such as YOLOv5 and SSD were employed. By combining the high-resolution detection of distant small targets using a telephoto camera with the wide-field-of-view advantage of a short-focus camera, accurate detection of long-distance small targets was achieved within a wide field of view. This fusion method not only meets the requirements for detecting abnormal encroaching obstacles at distances exceeding 200 meters, but also enables the detection of small targets with a minimum size of 40cm*40cm, significantly improving the efficiency and accuracy of autonomous obstacle detection for urban rail trains.
[0124] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0125] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0126] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0127] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment, whereby a series of operational steps are performed to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0128] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that, based on the technical solutions disclosed in the present invention, various modifications or variations that can be made by those skilled in the art without creative effort should be included within the scope of protection of the present invention.
Claims
1. A method for autonomous obstacle detection of urban rail trains based on long and short focal length image fusion, characterized in that, include: Acquire telephoto and short-focus images captured in the urban rail train scene; A pre-trained long-range small target detection model is used to process acquired telephoto and short-focus images to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation to enlarge the short-focus image. The registration unit obtains the registration regions for the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image, thus improving the long-range small target detection results. The method involves feature fusion with a short-focus image to supplement the missing feature information of small targets at the edges of the long-focus image using the complete feature information of the short-focus image. This includes: introducing a self-attention mechanism to calculate the correlation of different positions in the feature sequence, dynamically allocating weights, focusing on the small target regions with missing features in the long-focus image, and extracting relevant information from the short-focus image to supplement them; introducing a positional encoding mechanism to embed spatial position information into the feature sequence, determining the source and spatial position of each feature, and achieving feature alignment and integration between modalities; the classification unit is used to process the feature-fused image to obtain the detection results of small targets at a distance. The short focal length image is processed using a short focal length wide field of view target detection model to obtain the target detection results of the short focal length wide field of view image; The target detection results of the short focal length wide field of view image and the long distance small target detection results are fused at the decision level to obtain the final target detection result. This includes: applying non-maximum suppression to the long distance small target detection results to eliminate redundancy and duplication in the detection results; NMS removes the detection boxes with low confidence in the short focal length image by comparing the confidence of the detection boxes in the overlapping areas, and retains the most representative target boxes; using non-maximum suppression, the same target detected in the long and short focal length images is filtered according to the confidence of the target box, and the detection boxes with higher confidence are retained.
2. The autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion according to claim 1, characterized in that, Convolutional feature extraction is performed on telephoto and short-focus images respectively to obtain their feature maps; the feature maps of the telephoto and short-focus images are then concatenated and their features are encoded at their positions. The concatenated and position-encoded features are divided into multiple non-overlapping local windows using a sliding window. The local window attention is calculated, and the self-attention weight matrix is calculated. The value matrix is weighted using the self-attention weights to obtain the updated feature representation. The standard multi-head self-attention of the shifted window is calculated again, and the result is recalibrated to the same feature image as the input. This result is then added to the original modality branch as supplementary information. We use a weighted bidirectional feature pyramid network to achieve multi-scale feature fusion, and finally obtain feature maps of multiple sizes.
3. The autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion according to claim 1, characterized in that, In the classification unit, all detected target boxes are sorted according to their confidence levels, with the bounding box with the highest confidence level at the top. Starting from the bounding box with the highest confidence level, the remaining bounding boxes are traversed sequentially. For the currently traversed bounding box, if its overlap with the previously retained bounding boxes exceeds a certain threshold, it is suppressed and not retained. The process continues to traverse the next target box and repeats the above process until all target boxes have been processed. The last remaining target box is the target box of the distant small target captured by the telephoto image, which incorporates the short-focus image to supplement the missing edge feature information of the telephoto image.
4. The autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion according to claim 1, characterized in that, To fill the edge regions of the telephoto image with several pixels of 0, in order to complete the small target information where some features are missing at the edge of the telephoto image, this includes: filling the edge regions of the telephoto image with several pixels of 0. Fill with pixels of 0; assuming the original image is The size is The filled image is The size is ,in It is determined based on the size of the small target being identified.
5. An autonomous obstacle detection system for urban rail trains based on long and short focal length image fusion, characterized in that, include: The acquisition module is used to acquire telephoto and short-focus images captured in urban rail train scenes; The first detection module processes the acquired telephoto and short-focus images using a pre-trained long-range small target detection model to obtain long-range small target detection results. The long-range small target detection model includes a preprocessing unit, a registration unit, a supplementation unit, and a classification unit. The preprocessing unit performs bilinear interpolation magnification preprocessing on the short-focus image. The registration unit obtains the registration regions for the telephoto and short-focus images based on a normalized cross-correlation matching method. The supplementation unit fills the edge regions of the telephoto image with several pixels of 0 to complete the small target information where some features are missing at the edges of the telephoto image. Feature fusion is performed between telephoto and short-focus images. The complete feature information of small targets in the short-focus image is used to supplement the feature information of small targets that are partially missing at the edges of the telephoto image. This includes: introducing a self-attention mechanism to calculate the correlation between different positions in the feature sequence, dynamically allocating weights, focusing on small target regions with missing features in the telephoto image, and extracting relevant information from the short-focus image to supplement them; introducing a positional encoding mechanism to embed spatial location information into the feature sequence, determining the source and spatial location of each feature, and achieving feature alignment and integration between modalities; the classification unit is used to process the feature-fused image to obtain the detection results of small targets at long distances. The second detection module is used to process the short focal length image using a short focal length wide field of view target detection model to obtain the target detection result of the short focal length wide field of view image. The fusion module is used to perform decision-level fusion of target detection results from short-focal-length wide-field images with long-distance small-target detection results to obtain the final target detection result. This includes: applying non-maximum suppression (NMS) to the long-distance small-target detection results to eliminate redundancy and duplication in the detection results; NMS removing detection boxes with low confidence in the short-focal-length image by comparing the confidence of the detection boxes in the overlapping areas, and retaining the most representative target boxes; and using NMS to filter the same target detected in both long- and short-focal-length images according to the confidence of the target boxes, leaving the detection boxes with higher confidence.
6. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium is used to store computer instructions, which, when executed by a processor, implement the autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion as described in any one of claims 1-4.
7. A computer device, characterized in that, The system includes a memory and a processor, which communicate with each other. The memory stores program instructions that can be executed by the processor, and the processor calls the program instructions to execute the autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion as described in any one of claims 1-4.
8. An electronic device, characterized in that, include: The device includes a processor, a memory, and a computer program; wherein the processor is connected to the memory, and the computer program is stored in the memory. When the electronic device is running, the processor executes the computer program stored in the memory to cause the electronic device to execute instructions for implementing the autonomous obstacle detection method for urban rail trains based on long and short focal length image fusion as described in any one of claims 1-4.