A multimodal coordinated public transport hub operation and maintenance method and system
By adopting a multimodal collaborative approach to the operation and maintenance of public transportation hubs, the problems of cross-modal associations relying on external resources, insufficient semantics, and low fusion efficiency in multimodal knowledge bases are solved. This approach achieves deep semantic alignment and dynamic updates of multimodal data, thereby improving operation and maintenance efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG SUPCON INFORMATION TECH CO LTD
- Filing Date
- 2025-12-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for building multimodal knowledge bases based on knowledge graphs suffer from several core defects, including excessive reliance on the completeness of external resources for cross-modal associations, neglect of semantic diversity in image screening leading to missing key perspectives, incorrect matching due to semantic ambiguity in multimodal data fusion, and a lack of dynamic adaptability making it difficult to support real-time updates.
A multimodal collaborative public transportation hub operation and maintenance method is adopted, which includes collecting multi-source heterogeneous data for data preprocessing, constructing a multimodal contrastive learning framework for joint training, dynamically allocating modal weights, constructing a self-attention network, integrating multi-source heterogeneous data and constructing a structured knowledge base and its heterogeneous graph, constructing a cross-modal reasoning link when inputting operation and maintenance query requirements, and updating the heterogeneous graph link in real time.
It achieves deep semantic alignment of cross-modal features, retains high-value samples from multiple scenarios, dynamically adjusts modal contribution, supports online updates, and forms a multimodal collaborative knowledge base that covers accurate associations, semantic integrity, and efficient fusion, thus solving the shortcomings of traditional methods in terms of robustness, comprehensiveness, reasoning efficiency, and adaptability.
Smart Images

Figure CN122243707A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of intelligent transportation operation and maintenance technology, and in particular relates to a multimodal collaborative public transportation hub operation and maintenance method and system. Background Technology
[0002] As key nodes in urban transportation networks, the operation and maintenance (O&M) of public transportation hubs directly impacts the efficiency, safety, and passenger experience of the transportation system. Utilizing knowledge graphs for O&M of public transportation hubs can significantly improve O&M efficiency, reduce costs, and enhance the safety and reliability of the hubs by integrating multi-data elements, constructing O&M knowledge graphs, and implementing intelligent O&M functions. By using natural language processing (NLP) technology for information extraction and text mining of O&M data to construct knowledge graphs, and combining this with machine learning models for semantic understanding and question-answering reasoning, an unstructured multimodal knowledge base can be built. Unstructured knowledge bases are flexible and powerful tools in the field of knowledge management, specifically designed for storing, managing, and utilizing data lacking predefined formats or organizational structures, such as text, images, audio, and video. In complex scenarios such as public transportation hub O&M, unstructured knowledge bases can capture and integrate multi-source heterogeneous unstructured information, providing comprehensive and in-depth support for O&M decisions, and are suitable for richer knowledge representation and interaction scenarios in the O&M process of transportation hubs.
[0003] Existing patent CN119722421A discloses a knowledge graph-based intelligent operation and maintenance system for urban rail transit, comprising: a rail transit fault acquisition module, used to collect fault report information of each fault generated during the operation of the urban rail transit system in real time; a fault element extraction module, which is communicatively connected to the rail transit fault acquisition module and is used to receive fault report information and extract elements from the fault report information one by one to obtain fault element information that can comprehensively cover each fault; a fault matching module, used to match the fault element information of each fault with the knowledge graph of historical faults and output the matching result; and an operation and maintenance module, which evaluates multiple faults based on the matching result output by the fault matching module and provides maintenance plan information. This invention achieves rapid fault identification, accurate location, efficient processing, and intelligent operation and maintenance management. Summary of the Invention
[0004] Existing technologies for building multimodal knowledge bases based on knowledge graphs generally suffer from several core defects, including excessive reliance on the completeness of external resources for cross-modal associations, neglect of semantic diversity in image screening leading to the loss of key perspectives, incorrect matching due to semantic ambiguity in multimodal data fusion, and lack of dynamic adaptability making it difficult to support real-time updates.
[0005] To address the aforementioned technical problems, the present invention provides a technical solution: a multimodal collaborative operation and maintenance method for public transportation hubs, comprising the following steps: S1. Collect multiple data modalities from the operation and maintenance of public transportation hubs as multi-source heterogeneous data, and perform data preprocessing to obtain recommended solutions for violations or dangerous behaviors; S2. Construct a multimodal contrastive learning framework for joint training, and perform deep semantic matching on multi-source heterogeneous data and recommendation solutions; S3. Construct a self-attention network and dynamically allocate the weights of each modality in multi-source heterogeneous data; S4. Integrate multi-source heterogeneous data and solutions to build a structured knowledge base and its heterogeneous graph; S5. When inputting operation and maintenance query requirements, construct a cross-modal reasoning link and output the final solution plan; S6. Update the heterogeneous graph links based on the new operation and maintenance data.
[0006] Specifically, the multi-source heterogeneous data in S1 includes operation and maintenance data in three modalities: text, images, and videos. Data preprocessing includes structured and unstructured processing. Structured processing includes parsing text data such as operating parameters of public transportation hub equipment and rule manuals. Unstructured processing includes using YOLO models or multimodal large models to identify and analyze violations or dangerous passenger behaviors captured in images and surveillance video data during the operation of public transportation hubs, analyzing historical solutions, and extracting key entities of violations or dangerous passenger behaviors using BERT models. Combined with the information collected from the equipment, the location and time of the event are extracted. During scenario analysis, recommended solutions are also generated for violations or dangerous passenger behaviors within the public transportation hub.
[0007] Specifically, S2 constructs a CLIP-ViT architecture multimodal contrastive learning framework, including an input layer, a multimodal encoder, and an output layer. The input layer simultaneously receives three types of heterogeneous data: text, images, and videos. The multimodal encoder includes a dual-tower structure image encoder (Vision Transformer) and a text encoder (Transformer). The output layer ultimately generates a 128-dimensional joint embedding vector, which is then used for cross-modal retrieval through cosine similarity calculation to quickly match near-time cases and historical solution records in the historical database.
[0008] Specifically, the image encoder in the multimodal contrastive learning framework segments the input image and embeds positional codes, and uses a multi-layer self-attention mechanism to extract the spatial distribution features of high-temperature areas; the text encoder performs word segmentation and word embedding processing on the alarm text of violations and dangerous behaviors to capture the semantic connections of keywords.
[0009] Specifically, when the multimodal contrastive learning framework in S2 is jointly trained, it adopts a contrastive learning method and uses the InfoNCE loss function to reduce the distance between text-image-video embedding vectors under the same device failure event, while expanding the modal distance between different violations and dangerous behaviors.
[0010] Specifically, in S3, the self-attention network takes the semantic features of the query scenario as input, calculates the association strength between each modality and the query intent through a multi-head attention layer, and generates a comprehensive embedding vector containing dynamic temporal and static spatial features through weighted fusion.
[0011] Specifically, the node types in the heterogeneous graph in S4 include text nodes, image nodes, and video nodes; the nodes achieve cross-modal associations through dynamic edge relationships, which specifically cover co-occurrence relationships and causal relationships.
[0012] Specifically, when a user inputs an operation and maintenance query request in S5, cross-modal retrieval is performed based on the comprehensive embedding vector output by the self-attention network and the cosine similarity of the vector. Following the modal transformation order of image node → video node → text node, a hidden association path is searched from the heterogeneous graph. The multimodal data and recommended solutions on the association path are then validated in a closed loop. Structured data such as the event occurrence scenario and time are automatically integrated to generate the final solution plan.
[0013] Specifically, S6 categorizes new operational data, updates and replaces new rule-based text data with complete content that is timely and authoritative, and defines new violations and dangerous behaviors; it also performs similarity calculation and comparison analysis on routine operational data, and automatically integrates and classifies highly similar content.
[0014] This invention also provides a multimodal collaborative public transportation hub operation and maintenance system. Using the aforementioned multimodal collaborative public transportation hub operation and maintenance method, it includes a data acquisition and analysis module that collects multi-source heterogeneous data and generates recommended solutions for violations or dangerous behaviors; a joint training module that performs deep semantic matching between the multi-source heterogeneous data collected by the data acquisition and analysis module and the recommended solutions; a modality weight allocation module that dynamically allocates weights to each modality of the multi-source heterogeneous data; a heterogeneous graph construction module that constructs a knowledge base and its heterogeneous graph based on the multi-source heterogeneous data and recommended solutions after deep semantic matching; and a collaborative update module that collects new operation and maintenance data in real time to update the heterogeneous graph.
[0015] The beneficial effects of this invention are as follows: by introducing a vision-language joint embedding model to achieve deep semantic alignment of cross-modal features, combining graph neural networks and community detection algorithms to retain high-value samples from multiple scenarios in image screening, designing a dynamic weight allocation mechanism to adaptively adjust modal contribution based on the query scenario, and using reinforcement learning to optimize the association weights of knowledge entities to support online updates, a multimodal collaborative knowledge base construction scheme covering accurate association, semantic integrity, efficient fusion, and dynamic expansion is finally formed, effectively solving the shortcomings of traditional methods in terms of robustness, comprehensiveness, inference efficiency, and adaptability. Attached Figure Description
[0016] Figure 1 This is a flowchart of the present invention.
[0017] Figure 2 This is a system structure diagram of the present invention. Detailed Implementation
[0018] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0019] Example 1: A multimodal collaborative operation and maintenance method for public transportation hubs, such as... Figure 1 As shown, it includes the following steps: S1. Collect multiple data modalities from the operation and maintenance of public transportation hubs as multi-source heterogeneous data, and perform data preprocessing to obtain recommended solutions for violations or dangerous behaviors; S2. Construct a multimodal contrastive learning framework for joint training, and perform deep semantic matching on multi-source heterogeneous data and recommendation solutions; S3. Construct a self-attention network and dynamically allocate the weights of each modality in multi-source heterogeneous data; S4. Integrate multi-source heterogeneous data and solutions to build a structured knowledge base and its heterogeneous graph; S5. When inputting operation and maintenance query requirements, construct a cross-modal reasoning link and output the final solution plan; S6. Update the heterogeneous graph links based on the new operation and maintenance data.
[0020] This embodiment also provides a multimodal collaborative public transportation hub operation and maintenance system, using the above-described public transportation hub operation and maintenance method, such as... Figure 2 As shown, the system includes a data acquisition and analysis module, which collects multi-source heterogeneous data and generates recommended solutions for violations or dangerous behaviors; a joint training module, which performs deep semantic matching on the multi-source heterogeneous data and recommended solutions collected by the data acquisition and analysis module; a modality weight allocation module, which dynamically allocates weights to each modality of the multi-source heterogeneous data; a heterogeneous graph construction module, which constructs a knowledge base and its heterogeneous graph based on the multi-source heterogeneous data and recommended solutions after deep semantic matching; and a collaborative update module, which collects new operation and maintenance data in real time to update the heterogeneous graph.
[0021] The multi-source heterogeneous data in S1 includes maintenance data in three modalities: text, images, and videos. Data preprocessing includes structured and unstructured processing. For structured data processing, the system uses OCR (Optical Character Recognition) technology to perform high-precision parsing of documents such as subway equipment parameter tables and various rule manuals, automatically extracting key data such as equipment start-up time and equipment inspection time. For unstructured data processing, for violation detection, effective scene analysis is conducted based on the YOLO model and a multimodal large-scale model to accurately identify behaviors such as "illegal smoking," "climbing over turnstiles," and "strollers on escalators." The system also uses the BERT model to extract key entities such as "smoking" and "strollers," and combines this with equipment information to extract attribute information such as "shooting location" and "shooting time," providing guidance for violation detection and solution generation in the station hall. For intelligent passenger guidance in subway stations, effective scene analysis is conducted based on a multimodal large-scale model to extract key information such as "passenger flow" and "escalator operation status," and combined with equipment information such as "train operation status," providing passengers with comprehensive travel plans and route selections.
[0022] During joint training in S2, the system leverages the CLIP-ViT architecture's multimodal contrastive learning framework to achieve deep semantic matching of text, image, and video data (using the generation of emergency plans for elderly people falling on subway escalators as a specific example): the input layer simultaneously receives three types of heterogeneous data; the text layer imports historical case reports of passenger falls and corresponding emergency plans; the image layer accesses screenshots of passenger falls in the escalator area obtained from monitoring equipment; and the video layer loads video frame sequences before and after the passenger falls. The core model consists of a dual-tower Vision Transformer and a Transformer encoder. The image encoder segments the input image into 16×16 pixel blocks and embeds positional codes, utilizing a multi-layer self-attention mechanism to extract spatial distribution features of high-temperature areas; the text encoder performs word segmentation and word embedding on the alarm text, capturing semantic connections between keywords such as "fall" and "escalator" through the Transformer's contextual modeling function. The two encoders achieve intermodal parameter collaboration through a weight-sharing mechanism. In the training phase, a contrastive learning approach is employed. The InfoNCE loss function reduces the distance between text-image-video embedding vectors for the same equipment malfunction event, while simultaneously expanding the modal distance between different events. For example, the joint embedding vector of the passenger fall image, passenger information, and historical fall prevention plans corresponding to "Someone fell at xx subway station, xx exit" is controlled to be within a neighboring vector space. The output layer ultimately generates a 128-dimensional joint embedding vector (e.g., [0.12, -0.05, 0.87, ...]), which can be used for cross-modal retrieval through cosine similarity calculation. When a new input text "A child fell at xx subway station, Exit A" is received, the system can quickly match semantically similar fall cases and solution records in the historical database, providing subway maintenance personnel with more fault diagnosis references.
[0023] In the dynamic weight allocation process of S3, the dynamic modal weight allocation mechanism achieves scenario-based fusion of cross-modal information by introducing a self-attention network. This mechanism takes the semantic features of the query scenario as input and calculates the correlation strength between each modality and the query intent through a multi-head attention layer. For example, in the smart Q&A scenario at a train station, when a user enters the text query "Which line from station A to station B has less passenger flow?", the system calculates the attention weight and finds that keywords such as "station" and "line" in the text modality highly match the station and line knowledge base. At this time, the weight of the text modality is automatically increased to 70%. At the same time, by associating visual features such as "passenger flow" and "station status" from multiple surveillance cameras through cross-modal attention heads, the video modality is given a weight of 30% to assist in diagnosis. In the abnormal behavior monitoring scenario, for the query requirement of "contingency plan for elderly people falling on escalators", the system uses the attention mechanism to identify that the temporal features of the elderly person's fall process in the video modality are strongly correlated with the query intent, so the weight of the video modality is set to 60%. Simultaneously, static information (such as time and location) of the fall area in the text modality is assigned a weight of 40%. Finally, a comprehensive embedding vector containing dynamic temporal and static spatial features is generated through weighted fusion and used for subsequent pre-defined retrieval. This mechanism achieves real-time adjustment of modality weights through dynamic gating units. Its training process adopts a contrastive learning strategy to minimize the distance difference between embedding vectors of different modality combinations under the same query, while maximizing the difference in modality weight distribution between different queries. This ensures that modality contribution can still be accurately allocated in complex scenarios (such as mixed queries between security checkpoints and waiting areas), providing scene-adaptive semantic representation capabilities for the multimodal joint embedding space.
[0024] When constructing a structured knowledge base and its heterogeneous graph in S4, a structured knowledge network is built by integrating multi-source heterogeneous data. The network's node types include text nodes (e.g., station regulations, reports of past events), image nodes (images captured by surveillance cameras, records of violations), and video nodes (images of violations). Nodes are linked across modalities through dynamic edge relationships, specifically encompassing co-occurrence relationships (e.g., "passenger falls - emergency plan generated" appearing simultaneously in emergency reports and images) and causal relationships (e.g., the logical chain of "passenger smoking in waiting area - smoke alarm triggered"). This heterogeneous graph structure not only preserves the original features of single-modal data but also provides interpretable cross-modal association paths for subsequent collaborative reasoning by explicitly modeling multimodal edge relationships.
[0025] When a user inputs an operation and maintenance query in S5, cross-modal retrieval is performed based on the comprehensive embedding vector output by the self-attention network and the cosine similarity of the vector. Following the modal transformation order of image node → video node → text node, a hidden association path is searched from the heterogeneous graph. The multimodal data and recommended solutions on the association path are verified in a closed loop. Structured data such as the event occurrence scenario and time are automatically integrated to generate the final solution plan.
[0026] When updating heterogeneous graph links based on new operation and maintenance data in S6, the new operation and maintenance data is classified, and the new rule-based text data with timeliness and authority is completely updated and replaced, and new violations and dangerous behaviors are defined; similarity calculation and comparison analysis are performed on regular operation and maintenance data, and highly similar content is automatically integrated and classified.
[0027] Example 2: A multimodal collaborative operation and maintenance method for public transportation hubs, comprising the following steps: S1. Collect multiple data modalities from the operation and maintenance of public transportation hubs as multi-source heterogeneous data, and perform data preprocessing to obtain recommended solutions for violations or dangerous behaviors; S2. Construct a multimodal contrastive learning framework for joint training, and perform deep semantic matching on multi-source heterogeneous data and recommendation solutions; S3. Construct a self-attention network and dynamically allocate the weights of each modality in multi-source heterogeneous data; S4. Integrate multi-source heterogeneous data and solutions to build a structured knowledge base and its heterogeneous graph; S5. When inputting operation and maintenance query requirements, construct a cross-modal reasoning link and output the final solution plan; S6. Update the heterogeneous graph links based on the new operation and maintenance data.
[0028] This embodiment also provides a multimodal collaborative public transportation hub operation and maintenance system. Using the aforementioned public transportation hub operation and maintenance method, it includes a data acquisition and analysis module that collects multi-source heterogeneous data and generates recommended solutions for violations or dangerous behaviors; a joint training module that performs deep semantic matching on the multi-source heterogeneous data collected by the data acquisition and analysis module and the recommended solutions; a modality weight allocation module that dynamically allocates weights to each modality of the multi-source heterogeneous data; a heterogeneous graph construction module that constructs a knowledge base and its heterogeneous graph based on the multi-source heterogeneous data and recommended solutions after deep semantic matching; and a collaborative update module that collects new operation and maintenance data in real time to update the heterogeneous graph.
[0029] The multi-source heterogeneous data in S1 includes maintenance data in three modalities: text, images, and videos. Data preprocessing includes structured and unstructured processing. For structured data processing, the system uses OCR (Optical Character Recognition) technology to perform high-precision parsing of documents such as subway equipment parameter tables and various rule manuals, automatically extracting key data such as equipment start-up time and equipment inspection time. For unstructured data processing, for violation detection, effective scene analysis is conducted based on the YOLO model and a multimodal large-scale model to accurately identify behaviors such as "illegal smoking," "climbing over turnstiles," and "strollers on escalators." The system also uses the BERT model to extract key entities such as "smoking" and "strollers," and combines this with equipment information to extract attribute information such as "shooting location" and "shooting time," providing guidance for violation detection and solution generation in the station hall. For intelligent passenger guidance in subway stations, effective scene analysis is conducted based on a multimodal large-scale model to extract key information such as "passenger flow" and "escalator operation status," and combined with equipment information such as "train operation status," providing passengers with comprehensive travel plans and route selections.
[0030] During joint training in S2, the system leverages the CLIP-ViT architecture's multimodal contrastive learning framework to achieve deep semantic matching of text, image, and video data (using the generation of emergency plans for elderly people falling on subway escalators as a specific example): the input layer simultaneously receives three types of heterogeneous data; the text layer imports historical case reports of passenger falls and corresponding emergency plans; the image layer accesses screenshots of passenger falls in the escalator area obtained from monitoring equipment; and the video layer loads video frame sequences before and after the passenger falls. The core model consists of a dual-tower Vision Transformer and a Transformer encoder. The image encoder segments the input image into 16×16 pixel blocks and embeds positional codes, utilizing a multi-layer self-attention mechanism to extract spatial distribution features of high-temperature areas; the text encoder performs word segmentation and word embedding on the alarm text, capturing semantic connections between keywords such as "fall" and "escalator" through the Transformer's contextual modeling function. The two encoders achieve intermodal parameter collaboration through a weight-sharing mechanism. In the training phase, a contrastive learning approach is employed. The InfoNCE loss function reduces the distance between text-image-video embedding vectors for the same equipment malfunction event, while simultaneously expanding the modal distance between different events. For example, the joint embedding vector of the passenger fall image, passenger information, and historical fall prevention plans corresponding to "Someone fell at xx subway station, xx exit" is controlled to be within a neighboring vector space. The output layer ultimately generates a 128-dimensional joint embedding vector (e.g., [0.12, -0.05, 0.87, ...]), which can be used for cross-modal retrieval through cosine similarity calculation. When a new input text "A child fell at xx subway station, Exit A" is received, the system can quickly match semantically similar fall cases and solution records in the historical database, providing subway maintenance personnel with more fault diagnosis references.
[0031] In the dynamic weight allocation process of S3, the dynamic modal weight allocation mechanism achieves scenario-based fusion of cross-modal information by introducing a self-attention network. This mechanism takes the semantic features of the query scenario as input and calculates the correlation strength between each modality and the query intent through a multi-head attention layer. For example, in the smart Q&A scenario at a train station, when a user enters the text query "Which line from station A to station B has less passenger flow?", the system calculates the attention weight and finds that keywords such as "station" and "line" in the text modality highly match the station and line knowledge base. At this time, the weight of the text modality is automatically increased to 70%. At the same time, by associating visual features such as "passenger flow" and "station status" from multiple surveillance cameras through cross-modal attention heads, the video modality is given a weight of 30% to assist in diagnosis. In the abnormal behavior monitoring scenario, for the query requirement of "contingency plan for elderly people falling on escalators", the system uses the attention mechanism to identify that the temporal features of the elderly person's fall process in the video modality are strongly correlated with the query intent, so the weight of the video modality is set to 60%. Simultaneously, static information (such as time and location) of the fall area in the text modality is assigned a weight of 40%. Finally, a comprehensive embedding vector containing dynamic temporal and static spatial features is generated through weighted fusion and used for subsequent pre-defined retrieval. This mechanism achieves real-time adjustment of modality weights through dynamic gating units. Its training process adopts a contrastive learning strategy to minimize the distance difference between embedding vectors of different modality combinations under the same query, while maximizing the difference in modality weight distribution between different queries. This ensures that modality contribution can still be accurately allocated in complex scenarios (such as mixed queries between security checkpoints and waiting areas), providing scene-adaptive semantic representation capabilities for the multimodal joint embedding space.
[0032] When constructing a structured knowledge base and its heterogeneous graph in S4, a structured knowledge network is built by integrating multi-source heterogeneous data. The network's node types include text nodes (e.g., station regulations, reports of past events), image nodes (images captured by surveillance cameras, records of violations), and video nodes (images of violations). Nodes are linked across modalities through dynamic edge relationships, specifically encompassing co-occurrence relationships (e.g., "passenger falls - emergency plan generated" appearing simultaneously in emergency reports and images) and causal relationships (e.g., the logical chain of "passenger smoking in waiting area - smoke alarm triggered"). This heterogeneous graph structure not only preserves the original features of single-modal data but also provides interpretable cross-modal association paths for subsequent collaborative reasoning by explicitly modeling multimodal edge relationships.
[0033] When a user inputs an operation and maintenance query in S5, the system performs cross-modal retrieval based on the comprehensive embedding vector output by the self-attention network, using the cosine similarity of this vector. Following the modal transformation order of image node → video node → text node, it searches for a hidden association path in the heterogeneous graph. The system then performs closed-loop verification on the multimodal data and recommended solutions along this path, automatically integrating structured data such as the event's scenario and time to generate a final solution. For example, when the input query is "emergency plan for an elderly person falling on an escalator," the system uses a meta-path algorithm to automatically construct a cross-modal reasoning link. This algorithm searches for a hidden association path in the heterogeneous knowledge graph according to the modal transformation order of image node → video node → text node: first, it retrieves an image of the elderly person falling from a regular image; then, based on this image, it retrieves complete video recordings before and after the fall; finally, it combines text information (such as escalator number, time, and location) and historical passenger fall reports to form a complete chain of evidence. Based on the closed-loop verification of this multimodal evidence, the system generates the final inference result "emergency plan for elderly people falling", and automatically integrates structured data such as the fall scenario and time to provide staff with an actionable solution to the problem.
[0034] When updating heterogeneous graph links based on new operation and maintenance data in S6, the new operation and maintenance data is categorized. New, timely, and authoritative rule-based text data is completely updated and replaced, and new violations and dangerous behaviors are defined. Regular operation and maintenance data undergoes similarity calculation and comparison analysis, automatically integrating and classifying highly similar content. Taking the equipment operation and maintenance management scenario of a subway station as an example, when the system detects and receives new data information, it immediately triggers an intelligent link update mechanism to efficiently and orderly carry out the updating and iteration of various node contents. Specifically, the system adopts differentiated processing strategies based on different data types: for some repetitive or similar text data (such as reports of various violations generated in the daily operation of the station, passenger complaint records, etc.), the system uses advanced natural language processing technology and preset similarity algorithms to intelligently calculate and compare the similarity of these text contents, and then automatically integrates and classifies highly similar content; while for some timely and authoritative text data (such as the latest station management regulations, important notices issued by higher authorities, etc.), the system directly updates and replaces the entire content. Furthermore, for new station equipment and facilities (such as newly installed security screening equipment and ticket vending machines) and special cases such as newly defined violations, the system will directly perform comprehensive information updates on the relevant nodes to ensure the accuracy and timeliness of operation and maintenance data. The entire update process is automated, greatly improving the efficiency and accuracy of subway station operation and maintenance management.
[0035] This invention, through innovative technical design, effectively overcomes the core shortcomings of existing technologies, such as reliance on external resources for cross-modal association, insufficient semantics in image selection, and low efficiency in knowledge fusion. Its core advantage lies in constructing a multimodal collaborative framework that supports dynamic semantic alignment and efficient reasoning. This method abandons the coarse association approach based on hyperlinks or simple text matching found in traditional solutions, instead employing a vision-language joint embedding model (such as a Transformer-based cross-modal encoder). Through a contrastive learning mechanism, it achieves deep semantic alignment of modal features such as images, text, and videos. Even in scenarios with missing external resources or ambiguous descriptions, it can still capture implicit associations between entities through intra- / inter-modal attention mechanisms. In the image selection stage, this invention introduces a semantic diversity preservation strategy based on graph neural networks, jointly modeling image features with entity attributes and relationship networks in the knowledge graph. Community detection algorithms automatically identify and retain covered features. This method utilizes high-value images from different perspectives and scenarios, while leveraging adversarial training to enhance the model's robustness to noisy data. Furthermore, addressing semantic conflicts in multimodal knowledge fusion, a dynamic weight allocation mechanism is designed to automatically adjust the contribution of text, images, and videos based on the query scenario (e.g., for a contingency plan generated when an elderly person falls on an escalator, the synergistic effect of image and text is emphasized; for a fight in a waiting area, the correlation analysis between video and similar case reports is prioritized). An incremental graph neural network is combined to achieve online updates and version control of the knowledge base, ensuring a balance between reasoning response speed and knowledge timeliness. Finally, by constructing a dynamic knowledge graph containing heterogeneous nodes (text paragraphs, image regions, operation frames, etc.) and multiple types of edges (co-occurrence, temporal sequence, causal relationships), this invention achieves end-to-end optimization from data access and modality alignment to reasoning applications, providing a more accurate, interpretable, and adaptive solution for knowledge management and decision-making in complex scenarios.
Claims
1. A multimodal collaborative operation and maintenance method for public transportation hubs, characterized in that, Includes the following steps: S1. Collect multiple data modalities from the operation and maintenance of public transportation hubs as multi-source heterogeneous data, and perform data preprocessing to obtain recommended solutions for violations or dangerous behaviors; S2. Construct a multimodal contrastive learning framework for joint training, and perform deep semantic matching on multi-source heterogeneous data and recommendation solutions; S3. Construct a self-attention network and dynamically allocate the weights of each modality in multi-source heterogeneous data; S4. Integrate multi-source heterogeneous data and solutions to build a structured knowledge base and its heterogeneous graph; S5. When inputting operation and maintenance query requirements, construct a cross-modal reasoning link and output the final solution plan; S6. Update the heterogeneous graph links based on the new operation and maintenance data.
2. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, The multi-source heterogeneous data in S1 includes operation and maintenance data in three modalities: text, image, and video. Data preprocessing includes structured and unstructured processing. Structured processing includes parsing the operating parameters of public transportation hub equipment and the text data of the rule manual. Unstructured processing includes using object detection models or multimodal large models to identify and analyze violations or dangerous passenger behaviors captured in image and surveillance video data during the operation of public transportation hubs, analyzing historical solutions, and extracting key entities of violations or dangerous passenger behaviors using natural language processing models. Combined with the information collected from the equipment, the location and time of the event are extracted. During scene analysis, recommended solutions are also generated for violations or dangerous passenger behaviors within the public transportation hub.
3. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, The multimodal contrastive learning framework built in S2 includes an input layer, a multimodal encoder, and an output layer. The input layer simultaneously receives three types of heterogeneous data: text, images, and videos. The multimodal encoder includes a dual-tower structure image encoder (Vision Transformer) and a text encoder (Transformer). The output layer ultimately generates a 128-dimensional joint embedding vector, which is then used for cross-modal retrieval through cosine similarity calculation to quickly match near-time cases and historical solution records in the historical database.
4. The multimodal collaborative public transportation hub operation and maintenance method according to claim 3, characterized in that, The image encoder of the multimodal contrastive learning framework segments the input image and embeds positional codes, and uses a multi-layer self-attention mechanism to extract the spatial distribution features of high-temperature areas; the text encoder performs word segmentation and word embedding processing on the alarm text of violations and dangerous behaviors to capture the semantic relationship of keywords.
5. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1 or 3, characterized in that, When the multimodal contrastive learning framework in S2 is jointly trained, a contrastive learning method is used to reduce the distance between text-image-video embedding vectors under the same device failure event through the InfoNCE loss function, while expanding the modal distance between different violations and dangerous behaviors.
6. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, In S3, the self-attention network takes the semantic features of the query scenario as input, calculates the association strength between each modality and the query intent through a multi-head attention layer, and generates a comprehensive embedding vector containing dynamic temporal and static spatial features through weighted fusion.
7. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, In S4, the node types of heterogeneous graphs include text nodes, image nodes, and video nodes; nodes achieve cross-modal associations through dynamic edge relationships, which specifically cover co-occurrence relationships and causal relationships.
8. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, When a user inputs an operation and maintenance query in S5, cross-modal retrieval is performed based on the comprehensive embedding vector output by the self-attention network and the cosine similarity of the vector. Following the modal transformation order of image node → video node → text node, a hidden association path is searched from the heterogeneous graph. The multimodal data and recommended solutions on the association path are then verified in a closed loop. Event occurrence scenarios and times are automatically integrated to generate a final solution plan.
9. The multimodal collaborative public transportation hub operation and maintenance method according to claim 1, characterized in that, S6 categorizes new operational data, updates and replaces new rule-based text data that are timely and authoritative, and defines new violations and dangerous behaviors. Perform similarity calculation and comparison analysis on routine operation and maintenance data, and automatically integrate and classify highly similar content.
10. A multimodal collaborative public transportation hub operation and maintenance system, using the multimodal collaborative public transportation hub operation and maintenance method according to any one of claims 1-9, characterized in that, The system includes a data acquisition and analysis module, which collects multi-source heterogeneous data and generates recommended solutions for violations or dangerous behaviors; a joint training module, which performs deep semantic matching on the multi-source heterogeneous data and recommended solutions collected by the data acquisition and analysis module; a modality weight allocation module, which dynamically allocates weights to each modality of the multi-source heterogeneous data; and a heterogeneous graph construction module, which constructs a knowledge base and its heterogeneous graph based on the multi-source heterogeneous data and recommended solutions after deep semantic matching. The collaborative update module collects new operation and maintenance data in real time to update the heterogeneous graph.
Citation Information
Patent Citations
Urban rail transit intelligent operation and maintenance system based on knowledge graph
CN119722421A