Railway over-the-horizon intrusion detection method based on multi-modal retrieval enhancement
By constructing a multimodal retrieval-enhanced railway beyond-line-of-sight intrusion detection method, and utilizing the RAG vector database and retrieval-enhanced generative reasoning mechanism, the problems of false detection, missed detection, and interpretability in railway beyond-line-of-sight intrusion detection are solved, thereby improving the reliability and adaptability of detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING JIAOTONG UNIV
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-12
AI Technical Summary
Existing railway over-the-horizon intrusion detection technologies frequently result in false positives and false negatives in complex environments. They lack an interpretable output mechanism, and the knowledge is difficult to accumulate and continuously update, leading to insufficient detection stability and adaptability.
A multimodal retrieval-enhanced railway over-the-horizon intrusion detection method is constructed. Historical image/video data is processed through a cloud-based RAG vector database and index metadata. The retrieval-enhanced generation (RAG) inference mechanism is used to obtain Top-K similar historical scene evidence and perform evidence-enhanced inference to output interpretable intrusion detection results.
It improves the reliability and interpretability of detection in beyond-line-of-sight scenarios, reduces false alarm rates, achieves rapid adaptation and continuous update capabilities, and supports long-term stable operation across multiple lines and multiple camera positions.
Smart Images

Figure CN122200532A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of railway intrusion detection technology, and in particular to a railway over-the-horizon intrusion detection method based on multimodal retrieval enhancement. Background Technology
[0002] Along railway lines (main line, stations, sections, bridge and tunnel entrances, and areas near level crossings, etc.), there are risks such as unauthorized entry by personnel, animal intrusion, falling rocks / floating objects / foreign objects encroaching on the boundary, and construction machinery encroaching on the boundary. This is especially true in scenarios beyond line-of-sight (beyond the train driver's / station staff's line of sight, or in areas hundreds of meters to several kilometers ahead). In the field of railway intrusion / boundary violation detection and intelligent alarm, existing technologies mainly focus on "single-modal / single-model detection," "multi-sensor fusion," "perimeter linkage," and "point cloud intrusion rule judgment."
[0003] In the field of railway beyond-line-of-sight intrusion detection, retrieval-enhanced generative solutions have emerged in recent years. One solution proposes an automatic detection and early warning method for foreign object intrusion on railway tracks based on IoT technology, achieving intrusion identification and early warning through background learning and real-time monitoring of trackside and onboard video surveillance systems. Another solution proposes a machine vision-based method and system for detecting small foreign objects between railway tracks, achieving small foreign object discrimination through track area feature extraction and convolutional neural network modeling. Yet another solution proposes a machine vision-based method and system for detecting small foreign objects on railway track surfaces, determining the foreign object detection area and achieving anomaly identification through continuous frame acquisition, filtering, and feature extraction. Still another solution proposes a method for detecting foreign object intrusion in rail transit, improving the accuracy of small-scale target detection in complex scenarios through improvements such as a lightweight backbone of the YOLOv8 network, feature fusion, and a dynamic detection head.
[0004] Regarding perimeter monitoring and multi-source linkage, some solutions propose a high-speed railway perimeter intrusion risk monitoring method and system based on multi-target recognition. This system achieves comprehensive monitoring and visualization of perimeter intrusion risks by deploying a distributed perimeter monitoring network, recording anomalies, and analyzing the data at a processing center. Other solutions propose a railway perimeter intrusion monitoring method and system based on optical-visual fusion and deep learning. This system utilizes the fusion of sensor fiber optics and video information to achieve intrusion detection and alarm. Still other solutions propose an all-fiber perimeter security monitoring device that can be linked with video to detect and locate intrusions in the protected area and trigger early warnings and video linkage when intrusions are detected. Finally, a solution proposes a railway platform intrusion monitoring system and method based on artificial intelligence. This system improves platform security by detecting and identifying targets in designated areas of the platform and triggering alarm linkages.
[0005] In the area of 3D measurement and intrusion detection, some solutions propose an intelligent track intrusion detector based on 3D laser scanning equipment, which uses 3D scanning to detect intrusions; others propose an intelligent monitoring and alarm system for foreign object intrusion, which uses lidar to scan the track surface and combines neural network classification / dynamic and static recognition to achieve foreign object intrusion alarms; still others propose track intrusion detection methods, devices, and electronic equipment, which determine the current intrusion status by using pre-acquired boundary frame data and laser point cloud measurement data, and can overlay point cloud and image data to determine the type and location of intrusion facilities; and still others propose a railway construction site area intrusion detection method and system based on 3D point clouds, which achieves intrusion alarms in dangerous construction areas through point cloud preprocessing, area division, moving target detection and tracking, and intrusion rule discrimination.
[0006] In the area of knowledge augmentation and generative reasoning, some solutions propose a Retrieval-augmented Generation (RAG) framework that introduces retrieval components to construct augmented prompts and drive the output of the generative model, thereby improving the reliability of the generated results. Other solutions propose pre-training and fine-tuning methods for retrieval-augmented language models, enhancing the model's ability to invoke external knowledge through the collaborative work of the knowledge retrieval unit and the language model. Still others propose a multimodal data retrieval augmentation generation method and system based on RAG technology, constructing a "retrieval-augmentation-generation" processing flow for multimodal data.
[0007] The disadvantages of the existing railway over-the-line intrusion detection schemes mentioned above include:
[0008] (1) False alarms and missed alarms for small targets beyond line of sight remain prominent issues. Most solutions still rely on a single detection network or fixed rules, which lack stability and robustness for combined conditions such as "small targets at long distances + strong background interference + night / rain, snow and fog". As a result, false alarms and missed alarms are likely to occur in the field.
[0009] (2) Lack of an interpretable output mechanism of "conclusion-evidence-recommendation". Most solutions output detection boxes / categories or rule trigger results, which are difficult to provide verifiable comparative evidence and reasons for confidence. They also lack the ability to automatically generate disposal suggestions and risk classifications, which leads to the need for manual secondary analysis after alarms, high pressure on duty, slow response and difficulty in auditing and review.
[0010] (3) Knowledge is difficult to accumulate and continuously update, and the cost of scene migration and iteration is high. The handling rules, typical cases, and route characteristics are scattered in regulations and human experience, making it difficult to form a searchable knowledge base with the detection system. When new routes / new locations / new weather conditions appear, existing models often need to be retrained or adjusted, lacking a lightweight iteration mechanism that can be enhanced by adding cases incrementally. Summary of the Invention
[0011] This invention provides a railway beyond-line-of-sight intrusion detection method based on multimodal retrieval enhancement, thereby improving the detection reliability, interpretability, and continuous adaptability in railway beyond-line-of-sight scenarios.
[0012] To achieve the above objectives, the present invention adopts the following technical solution.
[0013] A railway over-line-of-sight intrusion detection method based on multimodal retrieval enhancement includes:
[0014] The cloud-based system constructs a RAG vector database and index metadata based on historical image / video data uploaded from trackside / edge nodes, a list of resources to be added to the database, and cloud-based database construction parameters.
[0015] Obtain real-time triggered Chinese intrusion detection issues / queries, and perform L2 normalization on the Chinese intrusion detection issues / queries based on the index metadata to obtain query vectors;
[0016] The query vector is used to perform a similarity search in the RAG vector database, and a Top-K similar historical scene evidence image set and a verifiable evidence package are obtained based on the search results;
[0017] Based on the Top-K similar historical scene evidence image set and verifiable evidence package, evidence enhancement reasoning is performed using the railway intrusion detection Prompt and structured templates to obtain railway beyond-line-of-sight intrusion detection results.
[0018] Preferably, the cloud-based construction of the RAG vector database and index metadata based on historical image / video data uploaded by trackside / edge nodes, a list of resources to be added to the database, and cloud-based database construction parameters includes:
[0019] The trackside / edge nodes upload historical scene image / video data and a list of resources to be added to the cloud. The cloud then generates a cloud catalog based on the historical image / video data, the list of resources to be added, and the cloud database creation parameters.
[0020] The CLIP encoder is initialized in the cloud. The CLIP encoder reads historical scene images one by one, opens the historical scene images using PIL, converts the historical scene images to RGB, forms a batch, calls the encoding function, generates image feature vectors, and performs L2 normalization on the image feature vectors. For each historical scene image, the system constructs a vector node, which includes: the normalized image feature vector, the cloud image path, and the minimum available metadata field.
[0021] A RAG vector database is built in the cloud, and all vector nodes are written into the RAG vector database. After the RAG vector database is built, the system writes the index metadata of the RAG vector database under QDRANT_PATH. This index metadata includes: the number of images added to the database, the CLIP model path used for database construction, the original image directory, the persistent directory and name of the vector database.
[0022] Preferably, the step of obtaining the real-time triggered Chinese intrusion detection question / query involves performing L2 normalization on the Chinese intrusion detection question / query based on the index metadata to obtain a query vector, including:
[0023] Obtain real-time triggered Chinese intrusion detection questions / queries, translate the Chinese intrusion detection questions / queries into English, and generate an English search query;
[0024] The index metadata of the RAG vector database is loaded, and the English retrieval query is segmented by the retrieval encoder based on the index metadata. Then, CLIP is called to obtain the text feature vector, and L2 normalization is performed on the text feature vector to encode the online retrieval query into a query vector in the same space as the image vector.
[0025] Preferably, the step of performing a similarity search on the query vector in the RAG vector database, and obtaining a Top-K similar historical scene evidence image set and a verifiable evidence package based on the search results, includes:
[0026] The query vector is encoded, and a retrieval tool is constructed based on the encoded query vector. The data container collection of the retrieval tool is specified. The retrieval tool is used to perform a similarity search on the RAG vector database using the encoded query vector. ImageNode image type nodes are filtered one by one, and the following fields are extracted to form the search results: score: similarity score; image_path / file_name / image_id / captions: image path and metadata. The Top-K similar historical scene evidence image set is obtained based on the search results. The system saves the evidence results and metadata of each search to form a verifiable evidence package.
[0027] The system creates a timestamp directory for each query, with the directory name containing the timestamp and query keywords.
[0028] Preferably, the step of using the railway intrusion detection Prompt and structured template to perform evidence enhancement reasoning based on the Top-K similar historical scene evidence image set and verifiable evidence package to obtain railway beyond-line-of-sight intrusion detection results includes:
[0029] Load the visual language model, select the dtype according to the device, use the corresponding tokenizer, and construct a railway intrusion detection prompt. Using the railway intrusion detection prompt, call `model.chat(...)` through the visual language model to perform question-and-answer reasoning on each similar historical scene evidence image. Use `VLM_QA_TEMPLATE` to output the real-time triggered Chinese intrusion detection question / query corresponding to the railway beyond-line-of-sight intrusion detection result. This result includes: conclusion, evidence, and suggested actions. If an intrusion is detected, trigger an alarm / link to the platform; if no intrusion is detected, continue monitoring.
[0030] As can be seen from the technical solutions provided by the embodiments of the present invention above, the present invention can improve the detection reliability and interpretability of beyond-line-of-sight scenarios and enhance the adaptability and availability of the system in long-term operation with multiple lines and multiple cameras by constructing a multimodal resource library of "intrusion sample library / non-intrusion control library" and adopting the retrieval-enhanced generation (RAG) inference mechanism.
[0031] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and will become apparent from the description or may be learned by practice of the invention. Attached Figure Description
[0032] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0033] Figure 1 A schematic diagram of the architecture of a RAG provided in an embodiment of the present invention;
[0034] Figure 2 This is a schematic diagram illustrating the implementation principle of a railway over-the-line intrusion detection method based on multimodal retrieval enhancement.
[0035] Figure 3 A flowchart of a railway over-the-line intrusion detection method based on multimodal retrieval enhancement is presented for embodiments of the present invention.
[0036] Figure 4 This is a schematic diagram illustrating the establishment of a RAG vector database in the cloud, as provided in an embodiment of the present invention.
[0037] Figure 5 This is a schematic diagram illustrating how to perform a problem query on a train, as provided in an embodiment of the present invention. Detailed Implementation
[0038] Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0039] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or couplings. The term “and / or” as used herein includes any and all combinations of one or more of the associated listed items.
[0040] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless defined as herein.
[0041] To facilitate understanding of the embodiments of the present invention, the following will provide further explanation and description with reference to the accompanying drawings and several specific embodiments. These embodiments do not constitute a limitation on the embodiments of the present invention.
[0042] A schematic diagram of the architecture of a RAG provided in this embodiment of the invention is shown below. Figure 1 As shown, RAG (Reference Aid Group) is a knowledge augmentation technology paradigm that combines "information retrieval" with "large model generation." It allows the system to retrieve evidence relevant to the current question from an external resource repository before generating an answer, and then complete reasoning and response under the constraints of that evidence. RAG includes components such as a retrieval unit, a vectorized encoder, a vector database, and a generator (large model). Compared to end-to-end generation methods that rely solely on parameter memory, RAG can significantly alleviate illusionary output, improve result traceability, and support rapid incremental updates for new knowledge / scenes.
[0043] based on Figure 1The RAG architecture shown in this invention illustrates the implementation principle of a railway over-the-line intrusion detection method based on multimodal retrieval enhancement generation. Figure 2 As shown, this method constructs a multimodal resource library at the center, and writes historical images into a vector database after vectorization encoding to form a searchable index. When an alarm is triggered or a query is initiated by a person, the method first retrieves Top-K similar historical scenes as evidence based on text / image semantic vector retrieval and forms a verifiable evidence package. Then, the "current query + retrieved evidence" is input into the visual language model, and under the constraints of a fixed structured template, the method outputs the intrusion conclusion, key evidence description, risk level and suggested actions. The results are used for alarm linkage and review and feedback. By incrementally updating the resource library and vector index, the method achieves "sample addition is enhancement", thereby improving the detection reliability, interpretability and long-term adaptability in beyond-line-of-sight scenarios.
[0044] This invention proposes a processing flowchart for a railway over-the-line intrusion detection method based on multimodal retrieval enhancement, as shown in the following embodiment. Figure 3 As shown, the process includes resource repository establishment, evidence retrieval, evidence enhancement response, and closed-loop update, specifically including the following processing steps:
[0045] Step S10: Build an offline image resource library and a RAG vector database in the cloud (cloud-based library building)
[0046] Input data: Historical image / video frames uploaded by trackside / edge nodes and a list of resources to be added to the database (image path, camera position / line / time and other metadata).
[0047] Output data: cloud-based offline image resource library; cloud-based RAG vector database (collection + vector index + ImageNode / embedding / payload); and index meta information index_meta.json, which includes clip_model, collection, and qdrant_path.
[0048] The connection between the steps is as follows: The RAG vector database (collection / index / vector data) output by S10 + index_meta.json serves as the consistency basis and input for loading the index and encoder in the S20 online retrieval terminal.
[0049] This invention addresses railway over-the-horizon intrusion operations by using an offline image resource library that is "capable of long-term operation and incremental updates" and a RAG vector database as the infrastructure for subsequent "evidence retrieval - structured reasoning - alarm closed loop".
[0050] Step S20: Online query vector generation (retrieval terminal / RAG retrieval service)
[0051] Retrieve real-time triggered Chinese intrusion detection issues / queries. Based on the aforementioned index metadata (index_meta.json), perform L2 normalization on these Chinese intrusion detection issues / queries to obtain a query vector, which is a CLIP text embedding. This step can also output English search queries for alignment and recording.
[0052] The query vector output in this step serves as the input for similarity retrieval in S30; and the clip_model / collection loaded in S20 is the same as that in S10, ensuring consistency in the subsequent retrieval space.
[0053] This invention embeds the strategy of "Chinese question → optional English query → unified vector space retrieval" into the railway intrusion retrieval link, and links it with subsequent evidence package storage and structured judgment in a closed loop.
[0054] Step S30: Vector database similarity retrieval and evidence package generation (Top-K recall / evidence storage)
[0055] The query vector (and optional English query) output in step S20 is retrieved in the RAG vector database output in S10. Based on the retrieval results, the following are obtained: a set of Top-K similar historical scene evidence images (score + image_path / image_id / captions, etc.); and a verifiable "evidence package" (timestamp directory + copy of evidence image + results.txt / results.json, etc.).
[0056] The Top-K evidence image set and structured information of the evidence package output in this step are used as input for S40 evidence-enhanced reasoning (VLM question answering).
[0057] This invention will "structure the search results into evidence packages (verifiable / traceable)" and use them as evidence input and audit basis for subsequent judgments, thereby enhancing interpretability and closed-loop capability.
[0058] Step S40: Based on the above Top-K similar historical scene evidence image set and verifiable evidence package, use the railway intrusion detection prompt and structured template (VLM_QA_TEMPLATE) to perform evidence enhancement reasoning and obtain the railway beyond-line-of-sight intrusion detection result. The detection result includes structured output + alarm linkage + closed-loop information, specifically including: yes / no / uncertain, evidence description, risk level, suggested action, alarm linkage command / record, and operation log (which can be traced back by associating with the S30 evidence package).
[0059] The logs / review conclusions output in this step can be fed back to support incremental database creation / index updates in S10, forming a closed loop. This invention improves the interpretability, traceability, and continuous adaptability of railway beyond-line-of-sight intrusion scenarios through a closed-loop approach of "evidence retrieval → evidence package → structured template constraint output → alarm linkage → log feedback update".
[0060] The above step S10 specifically includes: constructing an offline image resource library and a RAG vector database.
[0061] An embodiment of the present invention provides a schematic diagram of establishing a RAG vector database in the cloud, as shown below. Figure 4 As shown. This invention relies on the computing power and network conditions of edge acquisition nodes (such as private networks, 5G, Wi-Fi, or wired leased lines), and achieves unified retrieval and inference orchestration through a central platform.
[0062] The specific processing flow for creating RAG vector data in the cloud includes:
[0063] RAG vector database S10a. First, the trackside / edge nodes upload historical image / video data (or extracted frames) and a list of resources to be added to the cloud (metadata such as image path, camera position / line / time, etc.). The cloud then forms a cloud directory based on the historical image / video data, the list of resources to be added to the cloud, and cloud database creation parameters (such as image extension whitelist, maximum truncation number, whether to reset collection, etc.).
[0064] The cloud scans the cloud directory, specifically including:
[0065] It supports recursive scanning of subdirectories and filtering of image extensions (such as .jpg / .png / .webp / ...).
[0066] The scanned image paths are sorted to ensure a stable database creation order within the same dataset;
[0067] Supports truncation by maximum quantity for sampling database building or debugging.
[0068] S10b. Initialize the cloud-based Contrastive Language–Image Pre-training (CLIP) encoder (ensuring that the image vector and the text query vector are in the same semantic space, thus enabling similarity retrieval).
[0069] The database creation side uses a local CLIP model in Transformers format as the image encoder. During loading, `local_files_only` is enabled to use local model files and avoid network dependencies.
[0070] Load the CLIP main model;
[0071] Loading processor (responsible for converting PIL images into tensor inputs);
[0072] Move the model to the specified device and set it to eval inference mode.
[0073] S10c. Batch Image Vectorization Encoding (converts each historical scene image into a comparable semantic vector (embedding) and normalizes it, serving as the core index content for writing into the RAG vector database)
[0074] The database creation process involves batch processing and vectorization of images, with the following core steps:
[0075] Read the image files one by one, open them with PIL and convert them to RGB;
[0076] The batch is composed of an encoding function that generates image feature vectors and performs L2 normalization.
[0077] The vector representation (embedding list) corresponding to each image is obtained and written into the vector library.
[0078] A historical image shows "Nighttime, pedestrians entering the vicinity of the track 1km away." After CLIP encoding, a high-dimensional vector is obtained, which expresses the semantics of the image: features such as nighttime, person, track, intrusion, and distance are reflected in the vector space, facilitating similarity matching with the query vector.
[0079] S10d. Construct the data structure for the database (ImageNode + metadata payload, ensuring that after retrieval and recall, not only can the similarity score be obtained, but the specific image can also be located).
[0080] For each image, the system constructs a "vector node," which contains:
[0081] Embedding: The image vector obtained from 1c;
[0082] image_path: Cloud image path (or cloud file ID);
[0083] metadata: The minimum available metadata field.
[0084] S10e. Write to the cloud-based RAG vector database and persist the index (write all "vector nodes" to the RAG vector database in the cloud, establish vector indexes and collections, enabling the search end to quickly recall Top-K similar scenarios based on similarity; at the same time, ensure index persistence, supporting long-term operation and incremental updates.)
[0085] Deploy / connect to the RAG vector database in the cloud;
[0086] Create a vector collection;
[0087] Optional: If RESET_COLLECTION=True, then delete the collection with the same name before creating the database to avoid mixing old and new data;
[0088] Write the vector nodes into the vector library and create an index;
[0089] Persist the index and vector data to a disk directory.
[0090] S10f. Save index metadata (for loading by online search terminals)
[0091] After the database is created, the system writes index_meta.json to QDRANT_PATH, saving the information required by the search engine, such as:
[0092] num_images: The number of images added to the database;
[0093] clip_model: The path to the CLIP model used for database creation;
[0094] image_dir: The original image directory;
[0095] qdrant_path: The persistent directory for the vector library;
[0096] collection: collection name.
[0097] The output data of step S10 includes: cloud-based offline image resource library (image files + metadata list);
[0098] Cloud-based RAG vector database (collection + vector index + ImageNode / embedding / payload);
[0099] index_meta.json (num_images / clip_model / image_dir / qdrant_path / collection, etc.).
[0100] Step S20 above specifically includes: online query vector generation and Top-K similarity scene retrieval (retrieval terminal / RAG retrieval service).
[0101] S20a. Load the RAG database index and retrieval encoder (ensuring "consistent vector space, consistent collection, and consistent joins" during retrieval). The online retrieval end first reads index_meta.json and connects Qdrant based on clip_model and collection to restore the retrieval index.
[0102] S20b. Chinese Question → English Search Query Generation (Improving Recall Quality (especially the alignment of proper nouns, action descriptions, etc. in the English semantic space))
[0103] To improve the performance of Chinese CLIP searches, the system supports translating Chinese queries into English before searching:
[0104] S20c. Text vectorization encoding (encoding online search queries into query vectors in the same space as image vectors, used for similarity retrieval in the RAG database)
[0105] The retrieval end encodes the (English or Chinese) query into a text vector. The core process is as follows:
[0106] Query segmentation / preprocessing;
[0107] Call CLIP to obtain the text feature vector;
[0108] Perform L2 normalization on the text vector.
[0109] The Chinese question is: "Is there a person trespassing on the railway track about 1 km ahead?" The optional English search query is: "Is there a person trespassing on the railway track about 1 km ahead?" The "query vector" encodes this sentence into a high-dimensional vector that expresses semantics such as "person / trespassing / railway / distance ahead", which is used to compare with historical image vectors.
[0110] The above step S30 specifically includes:
[0111] S30a vector database similarity retrieval, recalling Top-K similar historical scenarios.
[0112] The retrieval system uses the encoded query vector for similarity retrieval to recall Top-K image evidence.
[0113] Construct the retriever and specify the collection (a data container used to store CLIP embeddings and their corresponding metadata (payload)).
[0114] Perform a similarity search to obtain the Top-K results;
[0115] Filter ImageNode type nodes one by one and extract the following fields to form the recall results: score: similarity score; image_path / file_name / image_id / captions: image path and metadata.
[0116] For the query above, the RAG database may recall similar scenarios as follows:
[0117] Similar scene #1: line_A / cam_03 / 2025-12-02_20:31.jpg (score=0.82)
[0118] Similar scene #2: line_B / cam_11 / 2025-11-18_07:12.jpg (score=0.77)
[0119] Similar scene #3: line_A / cam_03 / 2025-10-09_19:05.jpg (score=0.71)
[0120] These "similar scenes" are essentially historical images plus metadata. They are recalled by vector retrieval because of their semantic similarity and serve as evidence input for subsequent judgments.
[0121] S30b search results are stored and structured on disk (for easy verification / traceability).
[0122] The system saves the evidence results of each retrieval to disk, forming a verifiable "evidence package":
[0123] Create a timestamp directory for each query (the directory name contains the timestamp and query keywords);
[0124] Copy the recalled images to the output directory. The filenames should include information such as rank, score, and image_id to facilitate manual comparison and review.
[0125] Simultaneously, results.txt and results.json are generated, recording: Chinese questions, English search queries, search models, collections, Top-K, similarity scores, VLM answers, etc.
[0126] The above step S40 specifically includes: evidence enhancement reasoning and intrusion detection question answering (reasoning / generation module).
[0127] S40a. Load the Vision-Language Model (VLM)
[0128] When VLM is enabled, the system loads a local visual language model to perform Chinese intrusion detection question answering on the recalled images:
[0129] Load a VLM (such as MiniCPM-V as an example implementation);
[0130] Select the dtype according to the device (e.g., use bfloat16 for CUDA, and float16 for MPS).
[0131] Use the corresponding tokenizer;
[0132] During inference, call model.chat(...) to get the answer.
[0133] S40b. Construct a railway intrusion detection prompt and perform a question-and-answer process for each evidence map (using standardized prompts and structured templates to standardize the output of the judgment for each evidence map as "Conclusion - Evidence - Risk Level - Recommended Action").
[0134] The system performs question-and-answer reasoning once for each recalled image:
[0135] Using the VLM_QA_TEMPLATE unified output structure, the model output should be:
[0136] 1. Conclusion (Yes / No / Uncertain)
[0137] 2. Evidence (location / category / appearance, etc.)
[0138] 3. Risk level (low / medium / high)
[0139] 4. Recommended actions (alert / review / continue monitoring)
[0140] S40c. Structured output for alarm linkage and system closed-loop.
[0141] If an intrusion is detected → trigger an alarm / link to the platform;
[0142] If no intrusion is detected → continue monitoring;
[0143] At the same time, the "retrieval evidence + reasoning conclusion + timestamp + device / region information" will be written into the log as a basis for subsequent review and model / library updates.
[0144] The key information exchange steps of this invention can be summarized in the following format:
[0145] The system's interaction process is as follows: In the offline phase (cloud-based database construction), the database construction end receives historical images collected by edge cameras and uploaded to the cloud. After scanning the cloud image directory, it uses a Contrastive Language–Image Pre-training (CLIP) model to extract and normalize semantic feature vectors for each image. It encapsulates the "image vector + cloud image path / filename / number and other metadata" into entries, writes them to the RAG database (i.e., Retrieval-Augmented Generation, RAG; for example, it can be implemented based on vector databases such as Qdrant), and simultaneously generates an index_meta.json file recording index metadata such as model path / version, collection name, and data size for loading by the online retrieval end. In the online phase, when an intrusion detection query is triggered by the train control console / center side, the system receives the Chinese question and (optionally) uses the m2m418 translation model to convert it into an English retrieval query (e.g., "Is there a person trespassing on the railway track about 1 km ahead?" is translated into "Is there a person trespassing on the railway track about 1 km ahead?"). Then, it uses the CLIP model consistent with the database construction process. The text encoder encodes and normalizes the query to obtain a query vector (i.e., the semantic vector representation of the question). It then calls the retrieval module (e.g., LlamaIndex accessing the RAG database) to perform a similarity retrieval in the specified collection, recalling Top-K similar historical scene images and their similarity scores and metadata as "evidence" (similar scenes include: pedestrians approaching the track at night, construction workers encroaching on the track in the early morning, animals entering the track, etc.). Next, the system calls the Vision-Language Model (VLM; e.g., MiniCPM-V-2.6) for each evidence image according to the Chinese prompt word template of "Railway Intrusion Detection". It outputs structured results such as whether an intrusion has occurred, key evidence, risk level, and suggested actions. The retrieved evidence images are copied to the output directory with timestamps. At the same time, it saves results.txt / results.json containing Chinese questions, English search queries, Top-K, similarity scores, and VLM answers for subsequent review, alarm linkage, and closed-loop updates.
[0146] In summary, this invention, through the construction of a multimodal resource library consisting of an "intrusion sample library / non-intrusion control library" and the adoption of a retrieval-enhanced generation (RAG) inference mechanism, significantly reduces false alarms and hallucinations in complex environments by first retrieving similar evidence before generating interpretations during alarms, thereby improving the reliability and interpretability of detection in beyond-line-of-sight scenarios. Evidence retrieval based on vector retrieval and metadata filtering enables rapid localization of forward segments and robust cross-scenario identification. A differential calibration and multi-evidence consistency fusion strategy based on "intrusion evidence + non-intrusion control evidence" reduces disturbances. Based on structured output templates, the system can simultaneously provide intrusion conclusions, evidence descriptions, risk levels, and suggested actions, facilitating rapid decision-making by dispatchers and forming a traceable evidence chain, thus improving the auditability and review efficiency of security management. Furthermore, by feeding back manual review results into the resource library and incrementally updating the vector index, continuous iteration capability of "sample entry into the library is immediately enhanced" is achieved, reducing the costs of frequent retraining and long-term deployment, and improving the system's adaptability and availability in long-term operation across multiple lines and multiple locations.
[0147] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of one embodiment, and the modules or processes shown in the drawings are not necessarily essential for implementing the present invention.
[0148] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present invention.
[0149] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for apparatus or system embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments. The apparatus and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0150] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A railway over-line-of-sight intrusion detection method based on multimodal retrieval enhancement, characterized in that, include: The cloud-based system constructs a RAG vector database and index metadata based on historical image / video data uploaded from trackside / edge nodes, a list of resources to be added to the database, and cloud-based database construction parameters. Obtain real-time triggered Chinese intrusion detection issues / queries, and perform L2 normalization on the Chinese intrusion detection issues / queries based on the index metadata to obtain query vectors; The query vector is used to perform a similarity search in the RAG vector database, and a Top-K similar historical scene evidence image set and a verifiable evidence package are obtained based on the search results; Based on the Top-K similar historical scene evidence image set and verifiable evidence package, evidence enhancement reasoning is performed using the railway intrusion detection Prompt and structured templates to obtain railway beyond-line-of-sight intrusion detection results.
2. The method according to claim 1, characterized in that, The cloud-based system constructs a RAG vector database and index metadata based on historical image / video data uploaded from trackside / edge nodes, a list of resources to be added to the database, and cloud-based database construction parameters, including: The trackside / edge nodes upload historical scene image / video data and a list of resources to be added to the cloud. The cloud then generates a cloud catalog based on the historical image / video data, the list of resources to be added, and the cloud database creation parameters. The CLIP encoder is initialized in the cloud. The CLIP encoder reads historical scene images one by one, opens the historical scene images using PIL, converts the historical scene images to RGB, forms a batch, calls the encoding function, generates image feature vectors, and performs L2 normalization on the image feature vectors. For each historical scene image, the system constructs a vector node, which includes: the normalized image feature vector, the cloud image path, and the minimum available metadata field. A RAG vector database is built in the cloud, and all vector nodes are written into the RAG vector database. After the RAG vector database is built, the system writes the index metadata of the RAG vector database under QDRANT_PATH. This index metadata includes: the number of images added to the database, the CLIP model path used for database construction, the original image directory, the persistent directory and name of the vector database.
3. The method according to claim 2, characterized in that, The process of obtaining real-time triggered Chinese intrusion detection questions / queries involves performing L2 normalization on the Chinese intrusion detection questions / queries based on the index metadata to obtain a query vector, including: Obtain real-time triggered Chinese intrusion detection questions / queries, translate the Chinese intrusion detection questions / queries into English, and generate an English search query; The index metadata of the RAG vector database is loaded, and the English retrieval query is segmented by the retrieval encoder based on the index metadata. Then, CLIP is called to obtain the text feature vector, and L2 normalization is performed on the text feature vector to encode the online retrieval query into a query vector in the same space as the image vector.
4. The method according to claim 3, characterized in that, The process of performing a similarity search on the query vector in the RAG vector database, and obtaining a Top-K similar historical scene evidence image set and a verifiable evidence package based on the search results, includes: The query vector is encoded, and a retrieval tool is constructed based on the encoded query vector. The data container collection of the retrieval tool is specified. The retrieval tool is used to perform a similarity search on the RAG vector database using the encoded query vector. ImageNode image type nodes are filtered one by one, and the following fields are extracted to form the search results: score: similarity score; image_path / file_name / image_id / captions: image path and metadata. The Top-K similar historical scene evidence image set is obtained based on the search results. The system saves the evidence results and metadata of each search to form a verifiable evidence package. The system creates a timestamp directory for each query, with the directory name containing the timestamp and query keywords.
5. The method according to claim 4, characterized in that, The method of using the Top-K similar historical scene evidence image set and verifiable evidence package to perform evidence enhancement reasoning with railway intrusion detection Prompt and structured templates to obtain railway beyond-line-of-sight intrusion detection results includes: Load the visual language model, select the dtype according to the device, use the corresponding tokenizer, and construct a railway intrusion detection prompt. Using the railway intrusion detection prompt, call `model.chat(...)` through the visual language model to perform question-and-answer reasoning on each similar historical scene evidence image. Use `VLM_QA_TEMPLATE` to output the real-time triggered Chinese intrusion detection question / query corresponding to the railway beyond-line-of-sight intrusion detection result. This result includes: conclusion, evidence, and suggested actions. If an intrusion is detected, trigger an alarm / link to the platform; if no intrusion is detected, continue monitoring.