A multi-frame visual feature aggregation railway perimeter intrusion high-confidence identification method

By using a multi-frame visual feature aggregation method, multi-scale feature maps of railway surveillance videos are acquired and fused, which solves the problem of poor adaptability of existing detection methods to environmental changes and improves the accuracy and reliability of railway perimeter intrusion detection.

CN122200527APending Publication Date: 2026-06-12BEIJING JIAOTONG UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING JIAOTONG UNIV
Filing Date
2024-12-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing railway perimeter intrusion detection methods are poorly adaptable to environmental changes and are easily affected by obstruction and light interference, leading to frequent false alarms and missed alarms, and failing to effectively ensure railway operation safety.

Method used

A multi-frame visual feature aggregation method is adopted to acquire real-time image frames captured by monitoring equipment and extract feature maps at different scales. Feature maps of multiple historical image frames are acquired by combining preset quality conditions, and feature alignment and aggregation are performed. Multi-scale feature maps are then fused to determine the railway perimeter intrusion identification results.

🎯Benefits of technology

By fusing multi-frame information, the accuracy and fault tolerance of detection are improved, the adaptability to environmental changes is enhanced, false alarms and missed alarms are reduced, and the safety of railway operations is ensured.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200527A_ABST
    Figure CN122200527A_ABST
Patent Text Reader

Abstract

The application provides a railway perimeter intrusion high-confidence recognition method based on multi-frame visual feature aggregation. The method acquires real-time image frames in a railway monitoring video captured by a monitoring device, extracts real-time feature maps of different scales of the real-time image frames, acquires historical feature maps of different scales corresponding to multiple historical image frames meeting preset quality conditions when the quality of the real-time image frames meets preset aggregation recognition conditions, and performs feature alignment on the real-time feature maps and the historical feature maps. After alignment, the real-time feature maps and the aligned historical feature maps of multiple frames at each scale are aggregated to obtain aggregated feature maps at each scale. The aggregated feature maps at each scale are fused, and railway image fusion features of different scales are output to determine a recognition result of railway perimeter intrusion, so that the detection accuracy is improved through multi-frame and multi-scale information fusion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of railway system protection technology, and more specifically, to a high-reliability identification method for railway perimeter intrusion based on multi-frame visual feature aggregation. Background Technology

[0002] Due to the length of railway lines, the diversity of surrounding environments, and insufficient perimeter protection, railway systems face potential threats from various sources, such as large boulders, pedestrians, and wild animals. The intrusion of these foreign objects can threaten railway operational safety and potentially lead to serious accidents. To ensure train operational safety, it is necessary to promptly detect and prevent railway foreign object intrusion incidents. Summary of the Invention

[0003] In view of this, the purpose of this application is to provide a high-reliability identification method for railway perimeter intrusion based on multi-frame visual feature aggregation, which can solve the problem that single-frame image detection methods fail when affected by occlusion and light interference, and improve the detection accuracy.

[0004] This application provides a high-reliability method for railway perimeter intrusion identification through multi-frame visual feature aggregation, the method comprising:

[0005] Acquire real-time image frames from railway monitoring videos captured by monitoring equipment, and extract real-time feature maps of different scales from the real-time image frames;

[0006] When the quality of the real-time image frame meets the preset aggregation and recognition conditions, historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions are obtained.

[0007] The historical feature maps of each scale and the real-time feature map are aligned to obtain the historical feature maps of each scale.

[0008] The real-time feature map of each scale and the aligned multi-frame historical feature map are aggregated to obtain the aggregated feature map of each scale;

[0009] The aggregated feature maps at each scale are fused together, and railway image fusion features at different scales are output. Based on the railway image fusion features at different scales, the identification result of railway perimeter intrusion is determined.

[0010] In some embodiments, in the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation, after acquiring real-time image frames from railway monitoring videos captured by monitoring equipment, the method further includes:

[0011] When the quality of the real-time image frame does not meet the preset aggregation and recognition conditions, the recognition result of the railway perimeter intrusion is determined based on the real-time image frame.

[0012] In some embodiments, the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation includes, in the step of acquiring historical feature maps of different scales corresponding to multiple historical image frames that meet preset quality conditions, the method comprises:

[0013] The latest first historical image frame is scored based on the trained image scoring module to determine the quality score of the latest first historical image frame.

[0014] When the quality score of the historical image frame is greater than a first preset threshold of the memory bank, the historical feature map of the latest first historical image frame is updated in the memory bank; wherein, the storage capacity of the memory bank is the target number of frames, and when the preset storage capacity of the memory bank is full, the historical feature map of the earliest stored historical image frame is popped out, and the historical feature map of the latest first historical image frame is stored; the historical feature maps in the memory bank include historical feature maps of different scales;

[0015] When the quality score of the historical image frame is less than or equal to the first preset threshold, the preset memory is not updated;

[0016] Obtain historical feature maps at different scales corresponding to historical image frames of the target frame number in the memory bank.

[0017] In some embodiments, in the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation, after not updating the preset memory when the quality score of the historical image frame is less than or equal to a first preset threshold, the method further includes:

[0018] Record the number and quality score of consecutive historical image frames whose quality score is less than or equal to a first preset threshold.

[0019] When the number of consecutive historical image frames reaches a preset threshold, the comparison threshold of the memory bank is determined as a second preset threshold based on the quality score of the consecutive historical image frames.

[0020] The memory bank is updated based on the second preset threshold until the latest second historical image frame is greater than the first preset threshold, at which point the comparison threshold of the memory bank is restored to the first preset threshold.

[0021] In some embodiments, in the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation, the image scoring module is trained based on the following method:

[0022] An image scoring training set is constructed for the image scoring module; the image scoring training set includes single-frame sample images from railway monitoring videos captured by monitoring equipment; the single-frame sample images are marked with the true bounding boxes of the target objects;

[0023] The target object in the single-frame sample image of the image scoring training set is identified by the single-frame image target detection model, and the detection bounding box of the target object is determined.

[0024] Based on the comparison results of the ground truth bounding box and the detected bounding box of a single frame sample image, the score label of the single frame sample image is determined, and the image scoring training set is updated.

[0025] The image scoring module is trained based on the updated image scoring training set to obtain a trained image scoring module.

[0026] In some embodiments, the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation includes aligning the multi-frame historical feature maps and the real-time feature maps at each scale to obtain multi-frame aligned historical feature maps at each scale, comprising:

[0027] For each scale, the historical feature map and the real-time feature map are stitched together to obtain the stitched feature map.

[0028] The stitched feature map is fed into a deformable convolutional block to determine the offset between the historical feature map and the real-time feature map at this scale.

[0029] The historical feature map is processed based on the offset to obtain a historical feature map at that scale that is aligned with the real-time feature map.

[0030] In some embodiments, the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation, wherein aggregating the real-time feature map at each scale and the aligned multi-frame historical feature map to obtain the aggregated feature map at each scale includes:

[0031] For each scale of real-time feature map and the aligned multi-frame historical feature map, the multi-frame historical feature maps are aggregated based on the first weight of the historical feature map at that scale to obtain the aggregated historical feature map at that scale; wherein, the closer the time distance between the historical feature maps is, the greater the first weight is;

[0032] The difference between the aggregated historical feature map and the real-time feature map, the difference between the real-time feature map and the aggregated historical feature map, the aggregated historical feature map, and the real-time feature map are concatenated to obtain the concatenated feature map at this scale.

[0033] Process the stitched feature map at this scale and determine the second weight of each pixel in the stitched feature map at this scale; the second weight represents the contribution of each pixel to the aggregated feature map at this scale.

[0034] The spliced ​​feature map is processed based on the second weight of each pixel in the spliced ​​feature map to obtain the aggregated feature map at this scale.

[0035] In some embodiments, the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation, wherein fusing the aggregated feature maps at each scale and outputting railway image fusion features at different scales includes:

[0036] The third weight of each channel of the aggregated feature map at each scale is determined, and the aggregated feature maps at different scales are fused based on the weight of each channel to output the fused features of railway images at different scales.

[0037] The third weight represents the contribution of the channel.

[0038] In some embodiments, the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation includes determining the third weight of each channel of the aggregated feature map at each scale, and fusing aggregated feature maps at different scales based on the weight of each channel to output railway image fusion features at different scales, comprising:

[0039] The aggregated feature maps at each scale are fused using a PAN structure, and a third weight is applied to the channels of the aggregated feature maps at each scale through a channel attention module pre-configured in the PAN structure; wherein the third weight represents the contribution of the channel.

[0040] The PAN structure extracts and outputs railway image fusion features at different scales from convolutional layers of different scales.

[0041] In some embodiments, a railway perimeter intrusion identification system is also provided, the system including field devices, communication devices, and a server; the field devices include: field control devices, monitoring devices, and alerting devices;

[0042] The monitoring device is connected to the server via a communication device; the server is connected to the field control device, and the field control device is connected to the notification device.

[0043] The monitoring equipment is used to capture railway monitoring videos and send the captured railway monitoring videos to the server;

[0044] The server is used to receive railway monitoring videos captured by monitoring equipment, and execute the steps of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation to determine the identification result of railway perimeter intrusion; when the identification result is abnormal, it sends a prompt signal to the field control equipment.

[0045] The field control device is used to receive the prompt signal and control the prompt device to operate based on the prompt signal.

[0046] This application provides a high-reliability method for identifying railway perimeter intrusion based on multi-frame visual feature aggregation. The method acquires real-time image frames from railway monitoring videos captured by monitoring equipment and extracts real-time feature maps at different scales from these frames. When the quality of the real-time image frames meets preset aggregation and identification conditions, historical feature maps at different scales corresponding to multiple historical image frames that meet the preset quality conditions are acquired. The historical feature maps at each scale are aligned with the real-time feature maps to obtain aligned historical feature maps at each scale. The real-time feature maps at each scale and the aligned historical feature maps at each scale are aggregated to obtain aggregated feature maps at each scale. The aggregated feature maps at each scale are fused, and railway image fusion features at different scales are output. The railway perimeter intrusion identification result is determined based on the railway image fusion features at different scales. Thus, by fusing multi-frame information, the problem of single-frame image detection methods failing due to occlusion and light interference is solved. Simultaneously, information fusion based on multi-scale features enriches the information contained in the feature maps, greatly improving detection accuracy. Attached Figure Description

[0047] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 A flowchart of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application embodiment is shown;

[0049] Figure 2 A flowchart of the multi-scale feature fusion described in an embodiment of this application is shown;

[0050] Figure 3 A flowchart of the training image scoring module described in an embodiment of this application is shown;

[0051] Figure 4 A schematic diagram of the image scoring module described in an embodiment of this application is shown;

[0052] Figure 5 A flowchart illustrating an example of the memory bank update strategy described in an embodiment of this application is shown;

[0053] Figure 6 A flowchart illustrating another example of a memory bank update strategy described in an embodiment of this application is shown;

[0054] Figure 7A flowchart of the inter-frame feature alignment method according to an embodiment of this application is shown;

[0055] Figure 8 A flowchart of the multi-frame feature aggregation method according to an embodiment of this application is shown;

[0056] Figure 9 This paper illustrates the overall flowchart of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application, applied to a railway foreign object intrusion identification system.

[0057] Figure 10 A schematic block diagram of the railway perimeter intrusion identification system described in an embodiment of this application is shown. Detailed Implementation

[0058] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the accompanying drawings in this application are for illustrative and descriptive purposes only and are not intended to limit the scope of protection of this application. Furthermore, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of this application. It should be understood that the operations in the flowcharts may not be implemented in sequence, and steps without logical contextual relationships may be reversed or implemented simultaneously. In addition, those skilled in the art, guided by the content of this application, may add one or more other operations to the flowcharts, or remove one or more operations from the flowcharts.

[0059] Furthermore, the described embodiments are merely some, not all, of the embodiments of this application. The components of the embodiments of this application described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0060] It should be noted that the term "comprising" will be used in the embodiments of this application to indicate the presence of the features declared thereafter, but does not exclude the addition of other features.

[0061] Due to the length of railway lines, the diversity of surrounding environments, and insufficient perimeter protection, railway systems face potential threats from various sources, such as large boulders, pedestrians, and wild animals. The intrusion of these foreign objects can threaten railway operational safety and potentially lead to serious accidents. To ensure train operational safety, it is necessary to promptly detect and prevent railway foreign object intrusion incidents.

[0062] In some existing technologies, foreign object intrusion into railway perimeters can be detected based on video detection. However, existing video detection methods are poorly adaptable to environmental changes and have poor generalization ability. Weather changes or environmental changes may have a significant impact on the detection results, leading to frequent false alarms and missed alarms in existing detection methods.

[0063] Based on this, this application provides a high-reliability identification method for railway perimeter intrusion based on multi-frame visual feature aggregation. The method acquires real-time image frames from railway monitoring videos captured by monitoring equipment and extracts real-time feature maps at different scales from the real-time image frames. When the quality of the real-time image frames meets preset aggregation identification conditions, historical feature maps at different scales corresponding to multiple historical image frames that meet the preset quality conditions are acquired. The multi-frame historical feature maps and the real-time feature maps at each scale are aligned to obtain multi-frame historical feature maps aligned at each scale. The real-time feature maps and the aligned multi-frame historical feature maps at each scale are aggregated to obtain aggregated feature maps at each scale. The aggregated feature maps at each scale are fused, and railway image fusion features at different scales are output. The identification result of railway perimeter intrusion is determined based on the railway image fusion features at different scales. Thus, by fusing multi-frame information, the problem of single-frame image detection methods failing due to occlusion and light interference is solved. Simultaneously, information fusion based on multi-scale features enriches the information contained in the feature maps, greatly improving detection accuracy.

[0064] Please refer to Figure 1 , Figure 1 A flowchart of the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application embodiment is shown; as follows: Figure 1 As shown, the high-confidence identification method for railway perimeter intrusion based on multi-frame visual feature aggregation includes the following steps S101-S105:

[0065] S101. Acquire real-time image frames from railway monitoring videos captured by monitoring equipment, and extract real-time feature maps of different scales from the real-time image frames;

[0066] S102. When the quality of the real-time image frame meets the preset aggregation recognition conditions, acquire historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions.

[0067] S103. Align the multi-frame historical feature map and the real-time feature map at each scale to obtain the multi-frame historical feature map after each scale alignment.

[0068] S104. Aggregate the real-time feature map of each scale and the aligned multi-frame historical feature map to obtain the aggregated feature map of each scale.

[0069] S105. Fuse the aggregated feature maps at each scale and output the railway image fusion features at different scales. Determine the identification result of railway perimeter intrusion based on the railway image fusion features at different scales.

[0070] In step S101, real-time image frames from railway monitoring videos captured by monitoring equipment are acquired, and real-time feature maps of different scales of the real-time image frames are extracted.

[0071] The monitoring equipment refers to monitoring equipment installed on railways; the monitoring equipment includes cameras, infrared cameras, etc.

[0072] The monitoring equipment needs to be deployed according to the railway layout and monitoring requirements, for example, by deploying multiple camera nodes at certain intervals.

[0073] The real-time image frame in the railway monitoring video is the latest image frame, or the current image frame. To a certain extent, the real-time image frame (current image frame) refers to the processing of the high-reliability identification method for railway perimeter intrusion by aggregating multiple frames of visual features. That is, if the previous n frames have been identified, then the n+1 frame is the real-time image frame.

[0074] The real-time feature maps at different scales represent real-time image frames with different resolutions or granularities. For example, at a larger scale, the focus is on a wide-ranging railway scene and large objects (such as pedestrians), while at a smaller scale, the focus may be on details on the tracks and small objects (such as stones).

[0075] Specifically, the real-time feature maps of the real-time image frame at different scales are extracted by using a convolutional network.

[0076] Specifically, after feature extraction of the real-time image frame through the backbone network of the convolutional network, there will be shallow feature maps and deep feature maps. The features after multiple convolutions contain rich semantic information, but due to the low resolution, the spatial information of the object is not well preserved. On the other hand, although the low-level features have less semantic information, they have rich spatial information of the object due to the high resolution.

[0077] Please refer to Figure 2 , Figure 2 A flowchart of the multi-scale feature fusion described in an embodiment of this application is shown; as follows: Figure 2 As shown, by obtaining the feature maps output at each stage of the backbone network, real-time feature maps at different scales can be obtained. Fusing feature maps at different scales can enrich the information contained in the feature maps and improve the detection accuracy.

[0078] In step S102, when the quality of the real-time image frame meets the preset aggregation recognition conditions, historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions are obtained.

[0079] In some embodiments, after acquiring real-time image frames from railway monitoring videos captured by monitoring equipment, the method further includes:

[0080] When the quality of the real-time image frame does not meet the preset aggregation and recognition conditions, the recognition result of the railway perimeter intrusion is determined based on the real-time image frame.

[0081] After acquiring real-time image frames from railway monitoring videos captured by monitoring equipment, it is determined whether the quality of the real-time image frames meets the preset aggregation and recognition conditions. If it does, the aggregation and recognition steps described in steps S101-S105 are executed; otherwise, the real-time image frames are directly processed to determine the recognition result of railway perimeter intrusion.

[0082] The aggregation recognition method described in S102-S105 is executed by calling the feature aggregation module.

[0083] Using the feature aggregation module continuously during video detection increases computational load and reduces detection speed. Therefore, the feature aggregation module is only invoked when image quality is low and the scene is difficult to detect. If image quality is high, railway perimeter intrusion detection is performed directly to improve detection speed.

[0084] In some embodiments, a target detection quality scoring module is provided. The main function of this module is to score the image quality of the input feature map in order to determine whether the quality of the real-time image frame meets the preset aggregation recognition conditions.

[0085] Specifically, after acquiring real-time image frames from railway monitoring videos captured by monitoring equipment, the real-time image frames are scored based on a trained image scoring module to determine their quality scores.

[0086] When the quality score of the real-time image frame meets the preset aggregation and recognition conditions, the feature aggregation module is invoked to obtain historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions, and multi-frame feature aggregation is performed.

[0087] The quality score of the real-time image frame meets the preset aggregation recognition conditions. Specifically, the quality score of the real-time image frame is greater than the third preset threshold.

[0088] The image scoring module is trained based on the following method:

[0089] An image scoring training set is constructed for the image scoring module; the image scoring training set includes single-frame sample images from railway monitoring videos captured by monitoring equipment; the single-frame sample images are marked with the true bounding boxes of the target objects;

[0090] The target object in the single-frame sample image of the image scoring training set is identified by the single-frame image target detection model, and the detection bounding box of the target object is determined.

[0091] Based on the comparison results of the ground truth bounding box and the detected bounding box of a single frame sample image, the score label of the single frame sample image is determined, and the image scoring training set is updated.

[0092] The image scoring module is trained based on the updated image scoring training set to obtain a trained image scoring module.

[0093] Please refer to Figure 3 , Figure 3 A flowchart of the training image scoring module described in an embodiment of this application is shown; the image scoring module is a QS detection head, and the quality score of the image frame is the QS score.

[0094] Please refer to Figure 3 The image scoring module is a simple object detection auxiliary head (QS detection head). It calculates the image quality score for each image based on the difference between the detected bounding box in the detection result of the image frame and the real bounding box that should exist in the image frame. This score is used to determine whether to schedule the feature aggregation module.

[0095] In summary, the image scoring module mainly consists of two processes. The first process is to design an image quality scoring metric (QS score), which is used to quantify an image and obtain its image quality score. The second process is to use the image quality score obtained in the first process to train the image scoring module.

[0096] First, the image quality score is determined by the detected bounding boxes in the results of the single-frame image object detection model and the ground truth bounding boxes in the training set. Here, the detected bounding boxes of the single-frame image object detection model are divided into four types: positive, mostly positive, near positive, and negative.

[0097] The detected bounding box can also be called the predicted bounding box.

[0098] The values ​​are as follows: Positive: If the bounding box has the maximum intersection (IoU) with the ground truth bounding box and the IoU exceeds the positive threshold (t1), the result is marked as positive; Multi-positive: If the IoU of the bounding box is greater than t1 but not the maximum value, the result is classified as multi-positive; Near positive: If the IoU of the bounding box does not reach the positive threshold but is numerically close (e.g., t1 is 0.5 and IoU is 0.49), the result is designated as near positive; the IoU of near positive samples should be greater than the near positive threshold (t2); Negative: If the IoU of the predicted bounding box is less than t2, the result is classified as negative.

[0099] After defining these four types of predicted bounding boxes, the weighted samples, ws, are obtained by calculating according to formula (1):

[0100]

[0101] Where ws(p) represents the weighted sample; when p=1, ws(p=1) represents positive sample weighting, and when p=0, ws(p=0) represents negative sample weighting; L is all manually labeled ground truth boxes, and l represents a ground truth box in L; i represents the i-th predicted box of the network; IoU i The IoU value calculated between the i-th predicted bounding box and its corresponding ground truth bounding box l; C l,i The t1 represents the confidence score of the i-th predicted bounding box, and the subscript l,i indicates that the ground truth bounding box that matches the predicted bounding box i is l; t1 represents the positive threshold, t2 represents the near-positive threshold, and I P For the on / off function corresponding to the positive prediction box; I NR For the switching function corresponding to the near-positive prediction box; I M For the switching function corresponding to multiple positive prediction boxes; I N This is the switching function corresponding to the negative prediction box; when p=1, I P I NR I M These three equal 1, I N =0, at this time the positive sample weighting is calculated; when p=0, I N =1,I P I NR I M When all three are equal to 0, the negative sample weighting is calculated. The purpose of defining ws in formula (1) is to use ws(p=1) and ws(p=0) to calculate the QS score. The four types of IoUi under the second summation symbol "∑" in each term of formula (1) correspond to the four bounding boxes defined above: positive (max), multi-positive (t1,1), near-positive (t2,t1), and negative (0,t2). (The bounding box here is the prediction box.)

[0102] After obtaining the weighted sample ws through the above steps, the harmonic mean is used to balance the different samples, borrowing the idea of ​​F1-score. This requires two definitions: the first is wp, i.e., weighted precision, and the second is wr, i.e., weighted recall. The calculation formulas for these two variables are shown in (2) and (3):

[0103]

[0104] In formula (3), total_gt_sample is the total number of ground truth proposals in an image, wp represents weighted precision, and wr represents weighted recall. After calculating these two values, the image quality score can be obtained, and its calculation formula (4) is as follows:

[0105]

[0106] Here, QS represents the image quality score; ε represents a preset parameter, which is a very small value used to prevent the denominator from being 0. In this way, the image quality score is calculated, completing the first step of the image scoring module.

[0107] The second step involves using the image quality scores obtained in the first step to train the image scoring module; please refer to [link / reference needed]. Figure 3 The training process consists of four steps: (1) training a single-frame image object detector; (2) obtaining the detection results of the training images; (3) quantizing the QS score; and (4) using the training images and QS scores to train the QS predictor (image scoring module).

[0108] The first step in the four training steps is to train a single-frame image object detector (i.e., a single-frame image object detection model). Specifically, find a single-frame image object detector and train it using your own training dataset. For example, you can choose the basic detector, YOLOX, and then train it using the railway foreign object intrusion image dataset. After training, you will get the detection results of the training images, which is step (2). At this point, with the detection results and the ground truth of the training set, you can quantify the QS score according to the steps of the first process, which is step (3) of the training. That is, the image scoring training set has been constructed. At this point, only the last step remains.

[0109] Specifically, during the training phase, the railway foreign object intrusion image dataset requires the railway foreign object intrusion monitoring system to collect a large number of video samples covering various possible situations and scenarios. Then, these video datasets are decomposed frame by frame, with each frame treated as an image and labeled to obtain the railway foreign object intrusion video dataset.

[0110] The image scoring training set for the image scoring module was also determined based on the railway foreign object intrusion image dataset.

[0111] Please refer to Figure 4 First, let's introduce the QS detector (image scoring module). The QS detector is a lightweight auxiliary head with four layers: a convolutional network with a 3x3 kernel, an adaptive average pooling layer with a 7x7 kernel, and two fully connected layers. For image frame x, the QS score calculation formula (5) is as follows:

[0112] y = qs(f(x)), (5);

[0113] Where f(·) represents the feature extraction process of the backbone network in front of the QS detector, qs(·) represents the score operation of the QS detector; y ranges from 0 to 1, y represents the QS score; x represents the image frame. Then, a smooth L1 loss is used to optimize the QS detector, as shown in the following formula (6).

[0114]

[0115] Where qsloss(z) represents the loss function of the QS detector, z = gt - 10y; gt is the ground truth, representing the true QS score of image frame x, and y represents the predicted QS score of image frame x.

[0116] When the quality of the real-time image frame meets the preset aggregation recognition conditions, historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions are obtained.

[0117] In some optional embodiments, the historical feature maps of different scales corresponding to the multiple historical image frames that meet the preset quality conditions are pre-stored in a preset memory bank, the storage capacity of which is the target number of frames.

[0118] In this way, when performing real-time railway perimeter intrusion identification, it is only necessary to obtain historical feature maps of different scales corresponding to the target frame number of historical image frames from the memory bank, which can greatly reduce the amount of computation, improve the detection speed, and ensure the real-time performance of the detection.

[0119] In other words, the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application improves fault tolerance by aggregating multiple frames. To this end, a dictionary, MemoryBank, is set up to store feature maps of different scales from previous frames for use in the next feature aggregation operation. For example, the capacity of the MemoryBank is set to 5 frames, and it is continuously updated to maintain a high correlation between the feature maps in the array and the currently detected image.

[0120] In other words, the multi-frame historical image that meets the preset quality conditions refers to a preset number of historical image frames that meet the preset quality conditions.

[0121] In some embodiments, the multi-frame historical image that meets the preset quality conditions is a historical image frame with a quality score greater than a first preset threshold.

[0122] The step of obtaining historical feature maps at different scales corresponding to multiple historical image frames that meet preset quality conditions includes:

[0123] The latest first historical image frame is scored based on the trained image scoring module to determine the quality score of the latest first historical image frame.

[0124] When the quality score of the historical image frame is greater than a first preset threshold of the memory bank, the historical feature map of the latest first historical image frame is updated in the memory bank; wherein, the storage capacity of the memory bank is the target number of frames, and when the preset storage capacity of the memory bank is full, the historical feature map of the earliest stored historical image frame is popped out, and the historical feature map of the latest first historical image frame is stored; the historical feature maps in the memory bank include historical feature maps of different scales;

[0125] When the quality score of the historical image frame is less than or equal to the first preset threshold, the preset memory is not updated;

[0126] Obtain historical feature maps at different scales corresponding to historical image frames of the target frame number in the memory bank.

[0127] It should be noted that after the trained image scoring module scores real-time image frames, the scores obtained can also improve the effect of feature aggregation. The feature maps of images with higher scores have clearer information. When these high-quality feature maps are updated in the memory, better aggregation results can be obtained when feature aggregation is performed, thereby enhancing the detection effect.

[0128] In other words, scoring the latest first historical image frame based on the trained image scoring module and scoring the latest first historical image frame as a real-time image frame are the same process.

[0129] Similarly, the historical feature maps at different scales corresponding to historical image frames are also real-time feature maps at different scales extracted when they are analyzed as real-time image frames.

[0130] Image quality score is related to the quality of image feature maps. Therefore, this metric can be combined with the update strategy. When the image quality score is high, it is selected to be updated into the memory bank. When the image quality score is low, it means that the quality of the image feature maps is poor, so the memory bank is not updated. This allows for the pre-selection of multiple historical image frames with preset quality conditions.

[0131] The following provides a detailed example of how to update the bank memory.

[0132] Please refer to Figure 5 , Figure 5 A flowchart illustrating an example of the memory bank update strategy described in an embodiment of this application is shown; as follows: Figure 5 As shown, there are feature maps of 7 images, and the memory can store 5 feature maps. A first preset threshold t is set to determine which images can be updated into the memory, and then detection begins. Images 1 through 5 have scores higher than the first preset threshold t, so they are updated into the memory one by one, at which point the memory is full. The sixth image, after being input into the image quality scoring module, receives a score of 0.32, which is lower than the first preset threshold t, so it is not updated into the memory. The seventh image, after being processed by the image quality scoring module, receives a score of 0.85, which is higher than the first preset threshold t, so it is updated into the memory. However, the memory is already full, so the earliest feature map, i.e., the feature map of the first image, is popped from the memory.

[0133] The memory bank update strategy described in this application effectively ensures the quality of the feature maps in the memory bank, and also ensures that the feature maps in the memory bank can be updated over time, thus guaranteeing the high similarity between the feature maps in the memory bank and the feature maps of the detected images.

[0134] In some embodiments, the high-confidence railway perimeter intrusion identification method based on multi-frame visual feature aggregation also provides another memory update strategy. When the quality score of the historical image frame is less than or equal to a first preset threshold, and the preset memory is not updated, the method further includes:

[0135] Record the number and quality score of consecutive historical image frames whose quality score is less than or equal to a first preset threshold.

[0136] When the number of consecutive historical image frames reaches a preset threshold, the comparison threshold of the memory bank is determined as a second preset threshold based on the quality score of the consecutive historical image frames.

[0137] The memory bank is updated based on the second preset threshold until the latest second historical image frame is greater than the first preset threshold, at which point the comparison threshold of the memory bank is restored to the first preset threshold.

[0138] The following is a detailed example of another method for updating the memory bank;

[0139] Please refer to Figure 6 , Figure 6 A flowchart illustrating another example of a memory bank update strategy described in this application embodiment is shown. This strategy is further improved compared to the previous one to ensure its generalization. Specifically, firstly, for the first five frames of the initial test, regardless of their scores, they are all updated into the memory bank to ensure the number of feature maps in the memory bank. Then, the improved memory bank update process begins. Compared to the previously introduced memory bank update strategy, three variables are added: in addition to the normal first preset threshold t, con_dif_frames, temp_threshold, and a threshold array. con_dif_frames records the number of consecutive images below the threshold, temp_threshold represents a temporary threshold, and the threshold array records the scores of images appearing in con_dif_frames.

[0140] When a score is less than the first preset threshold t, on_dif_frames = con_dif_frames + 1, and threshold records the score. When the value of con_dif_frames is greater than the set value, it indicates that the scene has changed, and it is unreasonable not to update the memory. At this time, the average score of the threshold array will be assigned to temp_threshold as a temporary threshold, and con_dif_frames will start counting again. The subsequent update comparison threshold will be changed to temp_threshold until a frame image is greater than the normal score threshold, at which point the comparison threshold will be restored to t.

[0141] For details, please see [link / documentation]. Figure 6First, the array contains five stored feature maps. The variable `con_dif_frames`, used to record the number of consecutive images below a threshold, is assigned a threshold of 5. Simultaneously, the normal score threshold `t` is set to 0.5. The subsequent five images, after passing through the quality detection module, yield scores of 0.23, 0.19, 0.20, 0.22, and 0.20 respectively. It can be seen that none of these five images meet the threshold. Therefore, `con_dif_frames` is incremented by 1 five times, reaching a value of 5. The `threshold` array records the quality scores of these five images. The sixth image quality score yields a score of 0.21. At this point, the score does not meet the requirement for updating into the memory bank, but the value of con_dif_frames is 6, which is greater than 5. The average score of the threshold array will be assigned to temp_threshold as a temporary threshold, that is, the value of temp_threshold is 0.208. In this way, the feature map of the sixth image can be updated into the memory bank, and con_dif_frames will start counting again. The subsequent update comparison threshold will be changed to temp_threshold until a frame image is greater than the normal score threshold. Then the comparison threshold will be restored from the second preset threshold to the first preset threshold t.

[0142] In step S103, the multi-frame historical feature maps and the real-time feature maps at each scale are aligned to obtain multi-frame historical feature maps aligned at each scale.

[0143] In step S104, the real-time feature map of each scale and the aligned multi-frame historical feature map are aggregated to obtain the aggregated feature map of each scale.

[0144] Step S103 achieves feature alignment, and step S104 achieves feature aggregation. This application embodiment designs an MFFA module for feature alignment and feature aggregation, thereby realizing the design of the entire multi-frame feature aggregation network (MFFA network), as follows: Figure 2 As shown, in addition to the core algorithm of multi-frame feature aggregation, the network also includes a multi-scale feature fusion module and an SE attention mechanism, which can improve the detection accuracy.

[0145] The MFFA module in this application embodiment implements inter-frame feature alignment. Specifically, the step of aligning the multi-frame historical feature maps and the real-time feature maps at each scale to obtain multi-frame historical feature maps aligned at each scale includes:

[0146] For each scale, the historical feature map and the real-time feature map are stitched together to obtain the stitched feature map.

[0147] The stitched feature map is fed into a deformable convolutional block to determine the offset between the historical feature map and the real-time feature map at this scale.

[0148] The historical feature map is processed based on the offset to obtain a historical feature map at that scale that is aligned with the real-time feature map.

[0149] Please refer to Figure 7 , Figure 7 The flowchart of the inter-frame feature alignment method described in this application embodiment is shown. Specifically, deformable convolution is used to learn the offset between frames to achieve feature alignment. Unlike ordinary deformable convolution, which operates on a single feature map, multi-frame feature alignment deformable convolution operates on two frames of images. Figure 7 As shown. First, the current image frame I. t And another frame I s Use convolutional networks to extract I t Feature map f t and I s Feature map f s , combine the feature maps f of the two frames t f s Connecting them together yields the spliced ​​feature f. cat Then f cat The data is fed into a deformable convolutional block; the operation here is similar to that of ordinary deformable convolution, but an additional convolutional kernel is first defined to learn f. cat Chinese f s Compared to I t The offset m, and then based on this offset, f s After interpolation, normal convolution yields the output. Multiple deformable convolution blocks can be set here, resulting in a more accurate final offset.

[0150] Assuming there are multiple deformable convolutional blocks, after the operations of these multiple convolutional blocks, a final offset m0 is obtained. This offset m0 is then compared with the feature map f. s Perform deformable convolution, that is, first convert f based on m0 s Dilation, followed by normal convolution, yields a result similar to f. t Feature map f after feature alignment t+s Then, the next step of feature aggregation can be performed.

[0151] In some optional embodiments, the aggregation of the real-time feature map at each scale and the aligned multi-frame historical feature map to obtain the aggregated feature map at each scale includes:

[0152] For each scale of real-time feature map and the aligned multi-frame historical feature map, the multi-frame historical feature maps are aggregated based on the first weight of the historical feature map at that scale to obtain the aggregated historical feature map at that scale; wherein, the closer the time distance between the historical feature maps is, the greater the first weight is;

[0153] The difference between the aggregated historical feature map and the real-time feature map, the difference between the real-time feature map and the aggregated historical feature map, the aggregated historical feature map, and the real-time feature map are concatenated to obtain the concatenated feature map at this scale.

[0154] Process the stitched feature map at this scale and determine the second weight of each pixel in the stitched feature map at this scale; the second weight represents the contribution of each pixel to the aggregated feature map at this scale.

[0155] The spliced ​​feature map is processed based on the second weight of each pixel in the spliced ​​feature map to obtain the aggregated feature map at this scale.

[0156] Please refer to Figure 8 , Figure 8 A flowchart of the multi-frame feature aggregation method according to an embodiment of this application is shown; as follows: Figure 8 As shown, the process of multi-frame feature aggregation is as follows: the real-time feature map of the current real-time image frame is f t The historical feature maps contained in the multi-frame historical feature maps obtained from the memory bank are f. s1 f s2 f s3 f s4 and f s5 The 5 historical feature maps were first compared with the real-time feature map f. t Perform feature alignment to obtain five aligned historical feature maps f. t+s1 f t+s2 f t+s3 f t+s4 and f t+s5 The five historical feature maps have different relevance to the feature map of the current frame due to their different time periods. The feature map that is closer in time to the current frame should theoretically have a higher relevance to the feature map of the current frame. Therefore, when aggregating feature maps, the influence weight of each feature map should be distinguished.

[0157] For example, considering that although there are differences, the gap will not be particularly large in such a short period of time, the weights are set to 0.5, 0.6, 0.7, 0.8 and 0.9 respectively according to the time distance.

[0158] First, the five historical feature maps in the array are weighted and averaged using weighting factors to obtain an aggregated feature map of the five images, as shown in formula (7).

[0159]

[0160] Among them, f t+sj The graph representing the j-th historical feature; w j Characteristic f t+sj The first weight.

[0161] Next, the aggregated feature map f obtained from the aggregation will be... out Real-time feature map f of the current frame t The aggregation operation here uses adaptive information fusion, the specific process is as follows: Figure 8 As shown.

[0162] The differences between the aggregated historical feature map and the real-time feature map at this scale, the differences between the real-time feature map and the aggregated historical feature map, the aggregated historical feature map, and the real-time feature map are concatenated to obtain the concatenated feature map at this scale. Specifically, in order to make full use of the inter-frame temporal information, f out -f t f t -f out f t and f out When pieced together, we get f. multi .

[0163] Process the stitched feature map at this scale, determine the second weight of each pixel in the stitched feature map at this scale, specifically, f multi The number of channels is compressed using two convolutional layers, and information is fully fused. To improve the model's generalization ability, the softmax function is used to generate the final adaptive weights w(f). t ,f out) This is essentially an attention mechanism; the generated adaptive weights represent the contribution of each point to the fused image. The adaptive weights w(f) t ,f out) The second weight of each pixel in the stitched feature map representing this scale.

[0164] The stitched feature map is processed based on the second weight of each pixel in the stitched feature map to obtain the aggregated feature map at this scale, thereby realizing the inter-frame fusion of multiple features at this scale.

[0165] In step S105, the aggregated feature maps of each scale are fused and railway image fusion features of different scales are output. Based on the railway image fusion features of different scales, the identification result of railway perimeter intrusion is determined.

[0166] The process of fusing the aggregated feature maps at each scale and outputting fused railway image features at different scales includes:

[0167] The third weight of each channel of the aggregated feature map at each scale is determined, and the aggregated feature maps at different scales are fused based on the weight of each channel to output the fused features of railway images at different scales.

[0168] The third weight represents the contribution of the channel.

[0169] The process of determining the third weight of each channel of the aggregated feature map at each scale, and fusing aggregated feature maps at different scales based on the weights of each channel to output railway image fusion features at different scales includes:

[0170] The aggregated feature maps at each scale are fused using a PAN structure, and a third weight is applied to the channels of the aggregated feature maps at each scale through a channel attention module pre-configured in the PAN structure; wherein the third weight represents the contribution of the channel.

[0171] The PAN structure extracts and outputs railway image fusion features at different scales from convolutional layers of different scales.

[0172] Please refer to Figure 2 In this embodiment, when performing multi-scale feature fusion, a PAN structure is used to fuse the feature maps output from each stage of the backbone network. A channel attention module is added to apply a weight to each channel of the feature map. The magnitude of the weight represents the contribution of the channel, thereby measuring the importance of different channels. The working principle of the channel attention module is as follows: first, a weight is assigned to each channel of the feature map; then, the feature map is multiplied by the weight obtained for each channel to obtain the enhanced information, which is added to the original feature map to obtain the enhanced feature map.

[0173] The identification result of railway perimeter intrusion is determined based on the railway image fusion features at different scales. Specifically, the identification result is determined based on the foreign object features at different scales to identify whether there are foreign objects within the railway perimeter. If there are, the identification result is determined to be abnormal; otherwise, the identification result is determined to be normal.

[0174] Please refer to Figure 9 , Figure 9This document illustrates the overall flowchart of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application, applied to a railway foreign object intrusion identification system. The application of the railway foreign object intrusion monitoring system algorithm involves two stages. First, the model is trained using a previously constructed railway foreign object intrusion video dataset and railway foreign object intrusion image dataset to obtain a trained weight file. Then, the weight file is applied to the testing stage to process real-time railway videos and monitor railway safety. Specifically, in the testing stage (or application stage), for a newly acquired image frame, the quality score is first detected based on the image scoring module to determine if it meets the preset aggregation identification conditions and whether to perform an aggregation operation. If so, the MFFA module is called to acquire multiple historical image frames from the memory bank for aggregation, resulting in an aggregated image frame. This aggregated image frame is then input into the detection network, where a foreign object determination algorithm is used to determine if the boundary has been intruded. If aggregation is not performed, the new image frame is updated to the memory bank and directly input into the detection network, where the foreign object determination algorithm is used to determine if the boundary has been intruded. If the detection result indicates boundary intrusion, a notification is issued.

[0175] In some embodiments, a high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation and multi-frame feature fusion is used to deploy a corresponding railway perimeter intrusion identification system. Based on the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation that can improve detection accuracy and speed, a fast and accurate railway foreign object intrusion monitoring system is finally realized.

[0176] Please refer to Figure 10 , Figure 10 A schematic block diagram of a railway perimeter intrusion identification system according to an embodiment of this application is shown; the system includes field devices, communication devices, and a server; the field devices include: field control devices, monitoring devices, and alerting devices;

[0177] The monitoring device is connected to the server via a communication device; the server is connected to the field control device, and the field control device is connected to the notification device.

[0178] The monitoring equipment is used to capture railway monitoring videos and send the captured railway monitoring videos to the server;

[0179] The server is used to receive railway monitoring videos captured by monitoring equipment, and execute the steps of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation described in this application embodiment to determine the identification result of railway perimeter intrusion; when the identification result is abnormal, it sends a prompt signal to the field control equipment.

[0180] The field control device is used to receive the prompt signal and control the prompt device to operate based on the prompt signal.

[0181] In some embodiments, the railway perimeter intrusion identification system further includes terminal equipment in the control system.

[0182] The terminal device is connected to the server to receive target information sent by the server.

[0183] like Figure 10 As shown, the identification system includes railway field equipment, a server, and communication equipment connecting the two parts, including a router.

[0184] In terms of railway field equipment, this mainly includes monitoring equipment and some field control equipment. Monitoring equipment consists of surveillance cameras at various monitoring nodes in the railway, used to collect real-time video information from the field. Field control equipment includes relays and data processing modules. Relays are used to control the power supply to other field equipment, and data processing modules receive and process video sensor information to prepare for information transmission.

[0185] Communication equipment, including routers, is responsible for transmitting the collected data to the monitoring center or relevant personnel. It typically uses wireless communication technologies, such as Wi-Fi, cellular networks, or dedicated wireless communication protocols.

[0186] The backend server is the core of the system, responsible for processing and analyzing video data. It uses the multi-frame visual feature aggregation-based high-reliability railway perimeter intrusion identification method described in this application embodiment to perform real-time judgment on the video data to detect intrusion behavior or other anomalies. Once an anomaly is detected, a prompt message is sent to the on-site control equipment, which then controls the prompting device to emit audible and visual alerts to remind relevant personnel.

[0187] In some embodiments, the system can also simultaneously send abnormal situations to terminal devices such as computers and mobile phones, displaying them on the screen for further manual handling. The hardware components of the entire system work together to form a complete railway perimeter intrusion detection system, which can effectively improve railway safety and respond promptly to potential security risks.

[0188] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the method embodiments, and will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple modules or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection can be through some communication interfaces; the indirect coupling or communication connection of devices or modules can be electrical, mechanical, or other forms.

[0189] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0190] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0191] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a platform server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.

[0192] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A high-reliability method for railway perimeter intrusion identification based on multi-frame visual feature aggregation, characterized in that, The method includes: Acquire real-time image frames from railway monitoring videos captured by monitoring equipment, and extract real-time feature maps of different scales from the real-time image frames; When the quality of the real-time image frame meets the preset aggregation and recognition conditions, historical feature maps of different scales corresponding to multiple historical image frames that meet the preset quality conditions are obtained. The historical feature maps of each scale and the real-time feature map are aligned to obtain the historical feature maps of each scale. The real-time feature map of each scale and the aligned multi-frame historical feature map are aggregated to obtain the aggregated feature map of each scale; The aggregated feature maps at each scale are fused together, and railway image fusion features at different scales are output. Based on the railway image fusion features at different scales, the identification result of railway perimeter intrusion is determined.

2. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 1, characterized in that, After acquiring real-time image frames from railway monitoring videos captured by monitoring equipment, the method further includes: When the quality of the real-time image frame does not meet the preset aggregation and recognition conditions, the recognition result of the railway perimeter intrusion is determined based on the real-time image frame.

3. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 1, characterized in that, The step of obtaining historical feature maps at different scales corresponding to multiple historical image frames that meet preset quality conditions includes: The latest first historical image frame is scored based on the trained image scoring module to determine the quality score of the latest first historical image frame. When the quality score of the historical image frame is greater than a first preset threshold of the memory bank, the historical feature map of the latest first historical image frame is updated in the memory bank; wherein, the storage capacity of the memory bank is the target number of frames, and when the preset storage capacity of the memory bank is full, the historical feature map of the earliest stored historical image frame is popped out, and the historical feature map of the latest first historical image frame is stored; the historical feature maps in the memory bank include historical feature maps of different scales; When the quality score of the historical image frame is less than or equal to the first preset threshold, the preset memory is not updated; Obtain historical feature maps at different scales corresponding to historical image frames of the target frame number in the memory bank.

4. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 3, characterized in that, When the quality score of the historical image frame is less than or equal to a first preset threshold, after not updating the preset memory, the method further includes: Record the number and quality score of consecutive historical image frames whose quality score is less than or equal to a first preset threshold. When the number of consecutive historical image frames reaches a preset threshold, the comparison threshold of the memory bank is determined as a second preset threshold based on the quality score of the consecutive historical image frames. The memory bank is updated based on the second preset threshold until the latest second historical image frame is greater than the first preset threshold, at which point the comparison threshold of the memory bank is restored to the first preset threshold.

5. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 3, characterized in that, The image scoring module is trained based on the following method: An image scoring training set is constructed for the image scoring module; the image scoring training set includes single-frame sample images from railway monitoring videos captured by monitoring equipment; the single-frame sample images are marked with the true bounding boxes of the target objects; The target object in the single-frame sample image of the image scoring training set is identified by the single-frame image target detection model, and the detection bounding box of the target object is determined. Based on the comparison results of the ground truth bounding box and the detected bounding box of a single frame sample image, the score label of the single frame sample image is determined, and the image scoring training set is updated. The image scoring module is trained based on the updated image scoring training set to obtain a trained image scoring module.

6. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 1, characterized in that, The step of aligning the multi-frame historical feature maps and the real-time feature maps at each scale to obtain multi-frame aligned historical feature maps at each scale includes: For each scale, the historical feature map and the real-time feature map are stitched together to obtain the stitched feature map. The stitched feature map is fed into a deformable convolutional block to determine the offset between the historical feature map and the real-time feature map at this scale. The historical feature map is processed based on the offset to obtain a historical feature map at that scale that is aligned with the real-time feature map.

7. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 1, characterized in that, The aggregation of the real-time feature map at each scale and the aligned multi-frame historical feature map to obtain the aggregated feature map at each scale includes: For each scale of real-time feature map and the aligned multi-frame historical feature map, the multi-frame historical feature maps are aggregated based on the first weight of the historical feature map at that scale to obtain the aggregated historical feature map at that scale; wherein, the closer the time distance between the historical feature maps is, the greater the first weight is; The difference between the aggregated historical feature map and the real-time feature map, the difference between the real-time feature map and the aggregated historical feature map, the aggregated historical feature map, and the real-time feature map are concatenated to obtain the concatenated feature map at this scale. Process the stitched feature map at this scale and determine the second weight of each pixel in the stitched feature map at this scale; the second weight represents the contribution of each pixel to the aggregated feature map at this scale. The spliced ​​feature map is processed based on the second weight of each pixel in the spliced ​​feature map to obtain the aggregated feature map at this scale.

8. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 1 or 7, characterized in that, The process of fusing the aggregated feature maps at each scale and outputting fused railway image features at different scales includes: The third weight of each channel of the aggregated feature map at each scale is determined, and the aggregated feature maps at different scales are fused based on the weight of each channel to output the fused features of railway images at different scales. The third weight represents the contribution of the channel.

9. The high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation according to claim 8, characterized in that, The process of determining the third weight of each channel of the aggregated feature map at each scale, and fusing aggregated feature maps at different scales based on the weights of each channel to output railway image fusion features at different scales includes: The aggregated feature maps at each scale are fused using a PAN structure, and a third weight is applied to the channels of the aggregated feature maps at each scale through a channel attention module pre-configured in the PAN structure; wherein the third weight represents the contribution of the channel. The PAN structure extracts and outputs railway image fusion features at different scales from convolutional layers of different scales.

10. A high-reliability railway perimeter intrusion identification system based on multi-frame visual feature aggregation, characterized in that, The system includes field devices, communication devices, and a server; the field devices include: field control devices, monitoring devices, and alerting devices; The monitoring device is connected to the server via a communication device; the server is connected to the field control device, and the field control device is connected to the notification device. The monitoring equipment is used to capture railway monitoring videos and send the captured railway monitoring videos to the server; The server is used to receive railway monitoring videos captured by monitoring equipment and execute the steps of the high-reliability railway perimeter intrusion identification method based on multi-frame visual feature aggregation as described in any one of claims 1 to 9, to determine the identification result of railway perimeter intrusion; and to send a prompt signal to the field control equipment when the identification result is abnormal. The field control device is used to receive the prompt signal and control the prompt device to operate based on the prompt signal.