Collaborative perceptual spatial alignment method independent of external localization and global clock technology

By using deep convolutional neural networks and the BEVGlue method, cooperative sensing spatial alignment that does not rely on external positioning and global clocks is achieved, solving the problems of high cost and weak anti-interference ability in existing technologies, and realizing efficient and secure cooperative sensing.

CN118505752BActive Publication Date: 2026-06-26SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2024-06-05
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing collaborative sensing methods rely on high-end positioning equipment and have weak anti-interference capabilities, resulting in high costs and decreased accuracy during malicious attacks, making normal collaborative sensing impossible.

Method used

A deep convolutional neural network is used to encode the observed data into a bird's-eye view feature map. The BEVGlue method is used to calculate the relative pose for spatial alignment. Spatial alignment between agents is achieved through graph modeling and maximum common subgraph search, avoiding dependence on external localization and global clock.

Benefits of technology

It achieves efficient collaborative awareness in noisy and malicious attack environments, reduces system costs, improves robustness and universality, and ensures collaborative security.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118505752B_ABST
    Figure CN118505752B_ABST
Patent Text Reader

Abstract

The application provides a cooperative perception spatial alignment method independent of external positioning and global clock technology, comprising the following steps: S11, each intelligent agent respectively encodes individual observed data, converts the data into a bird's eye view feature map using a deep convolutional neural network, and completes target detection and tracking to obtain an object detection frame and tracking results; each intelligent agent sends communication information to all cooperative objects; S12, each intelligent agent calculates a relative pose through a BEVGlue method according to the received communication information, and completes spatial alignment; S13, each intelligent agent converts the bird's eye view feature map and the detection frame after spatial alignment to a respective coordinate system; and S14, the intelligent agent aggregates the bird's eye view feature map and the bird's eye view feature map after coordinate conversion, and performs detection frame detection based on the aggregated feature map. The application promotes cooperative perception under a spatial error condition based on the BEVGlue method, and effectively alleviates the influence of spatial noise on cooperative perception.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of collaborative sensing technology, and more specifically, to a collaborative sensing spatial alignment method that does not rely on external positioning and global clock technology. Background Technology

[0002] Collaborative perception can greatly enhance the perception capabilities of multiple agents by sharing information among them. However, to achieve collaborative perception in a consistent spatial coordinate system, most existing methods rely on high-end positioning equipment to provide precise location, making these methods typically expensive and difficult to scale. Furthermore, their accuracy degrades or they become unusable when affected by noise or malicious attacks. These drawbacks fundamentally limit the practical application of collaborative perception.

[0003] Current spatial alignment methods and systems have the following drawbacks: Regarding positioning errors, existing technologies mainly improve perception performance by addressing the inaccuracy of collaborative information caused by insufficient hardware output precision in positioning functions. However, these technologies can only resolve minor positional shifts and must rely on external positioning data with relatively high accuracy for coarse positioning. In terms of resistance to malicious attacks, existing technologies are highly dependent on external positioning data, providing access to false data. Attacks on positioning data can lead to spatial alignment errors, preventing normal collaborative sensing activities. Summary of the Invention

[0004] To address the shortcomings of existing technologies, the purpose of this invention is to provide a collaborative sensing spatial alignment method that does not rely on external positioning and global clock technologies.

[0005] According to one aspect of the present invention, a cooperative sensing spatial alignment method that does not rely on external positioning and global clock technology is provided, comprising:

[0006] S11, each agent encodes the data observed by itself, uses a deep convolutional neural network to convert the encoded observation data into a bird's-eye view feature map and completes target detection and tracking, and obtains object detection boxes and tracking results as communication information; each agent sends the communication information to all collaborating objects;

[0007] S12, each intelligent agent calculates the relative pose using the BEVGlue method based on the received communication information to complete spatial alignment;

[0008] S13, each agent transforms the spatially aligned bird's-eye view features and detection boxes to their respective coordinate systems;

[0009] S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and performs bounding box detection based on the aggregated feature map.

[0010] Preferably, in step S11, each agent encodes the data observed by itself, uses a deep convolutional neural network to convert the encoded observation data into a bird's-eye view feature map, and completes target detection and tracking to obtain object detection boxes and tracking results, specifically as follows:

[0011]

[0012] in This represents the raw data collected by agent i at time t. This represents a bird's-eye view feature map corresponding to the original data. This represents the object detection bounding box and tracking result corresponding to the original data; f detection&tracking This represents the single-unit detection and tracking module.

[0013] Preferably, in step S12, the relative pose is calculated using the BEVGlue method to complete spatial alignment, specifically as follows:

[0014]

[0015] in This represents the bounding boxes and tracking results generated by agent i at time t; f represents the estimated relative pose of agent j at precise time t between agents i and j. BEVGlue This represents the BEVGlue spatial alignment method.

[0016] Preferably, the BEVGlue method includes:

[0017] S12.1: The agent extracts the shape of the detection box, tracking results and relative position information, and obtains the feature map of the detected target through graph modeling;

[0018] S12.2: Perform a maximum common subgraph search based on the detected target feature map to obtain the maximum common subgraph and its score;

[0019] S12.3: Based on the maximum common subgraph and score in S12.2, calculate the relative transformation to obtain the relative pose differences between agents.

[0020] Preferably, in S12.1, obtaining the target feature map through graph modeling specifically involves:

[0021]

[0022] For each target detected by agent i at time t, each target constitutes a node. The characteristics representing node m. The edge characteristics representing the connection between nodes m and n; Represents the length and width of the target being detected. The tracking results represent the detected target; This represents the distance between two detected targets. This represents the angle between two detected targets. This represents the difference in the orientation angles of the two detected targets;

[0023] m and n are loop variables that need to be iterated through all nodes to form the target feature map of agent i at time t.

[0024] Preferably, in S12.2, the maximum common subgraph search is performed based on the feature map of the detection box, specifically as follows:

[0025]

[0026] in This represents the maximum common subgraph between agents i and j at time t. The maximum common subgraph between agents i and j at time t-1; The feature maps of the detected targets representing agents i and j; f MCS This represents the maximum common subgraph search model.

[0027] Preferably, the maximum common subgraph search model f MCS The implementation process includes:

[0028] First of all, Nodes with matching potential are selected to form an initial matching list. At the initial moment, the similarity between nodes is calculated. If the similarity exceeds a set threshold, they are considered a potential pair of matching nodes; at non-initial times, the initial matching list is obtained based on the maximum common subgraph and target tracking results of the previous time step.

[0029] Next, for all the initial matching pairs in the initial matching list, take one group. Comparison with other potential matching pairs Edge similarity If the edge similarity is greater than a set threshold, then... It will be added to the matching list;

[0030] Finally, duplicate common subgraphs and graphs with no more than a threshold number of nodes are removed, and each potential match is processed. Calculate confidence score Where C represents a subgraph The size of; in all subgraphs with the largest C, such that The largest subgraph is the largest common subgraph.

[0031] Preferably, in step S12.3, based on the maximum common subgraph and score in S12.2, a relative transformation is calculated to obtain the relative pose differences between agents, including:

[0032]

[0033] Where R is a two-dimensional rotation matrix, t is a two-dimensional translation vector, and p i q i For the matched pairs of points, C represents the maximum subgraph. Size (number of nodes), R * ,t * Constructing relative pose estimation

[0034] Preferably, in step S13, each agent transforms the spatially aligned bird's-eye view features and detection boxes to its own coordinate system, specifically as follows:

[0035]

[0036] in This represents the estimated relative pose between agents i and j. This represents the accurate bird's-eye view of the feature map of agent j at time t. This represents the accurate relative pose of agent j between agents i and j in the space reference frame of agent i. The following is an aerial view of the features, f transform This represents the coordinate transformation formula.

[0037] Preferably, in step S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and performs bounding box detection based on the aggregated feature map, specifically:

[0038]

[0039] in Represents the intelligent agent i at its own clock t i Time-based detection box detection results; f decoder The decoding process from the representative feature map to the detection result is performed through a neural network. This represents aggregation characteristics.

[0040] Compared with the prior art, the embodiments of the present invention have at least one of the following beneficial effects:

[0041] The cooperative sensing spatial alignment method in this embodiment of the invention, which does not rely on external positioning and global clock technology, is based on the BEVGlue method, which promotes cooperative sensing under spatial error conditions and effectively mitigates the impact of spatial noise on cooperative sensing.

[0042] The collaborative sensing spatial alignment method in this invention, which does not rely on external positioning and global clock technology, focuses on the pose geometry of objects during the maximum common subgraph matching process. Compared with traditional methods such as ICP and TESAER, it requires less communication and has strong robustness.

[0043] The collaborative sensing spatial alignment method in this embodiment of the invention, which does not rely on external positioning and global clock technology, utilizes relative position and shape during the maximum subgraph matching process. It is unaffected by the observation angle, has high universality, and is low in cost.

[0044] The collaborative sensing spatial alignment method in this embodiment of the invention, which does not rely on external positioning and global clock technology, can automatically determine whether the current collaboration between multiple agents is safe, promptly cut off high-risk collaboration, and ensure the security of collaboration. Attached Figure Description

[0045] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0046] Figure 1 A flowchart illustrating a collaborative sensing spatial alignment method that does not rely on external positioning and global clock technology, according to an embodiment of the present invention;

[0047] Figure 2 This is a schematic block diagram of a preferred embodiment of the collaborative sensing spatial alignment method of the present invention that does not rely on external positioning and global clock technology. Detailed Implementation

[0048] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention. These all fall within the scope of protection of the present invention.

[0049] like Figure 1 and Figure 2 The diagram shows a flowchart of a cooperative sensing spatial alignment method that does not rely on external positioning and global clock technology, according to an embodiment of the present invention. The main process is as follows:

[0050] S11, each agent encodes the data observed by itself, uses a deep convolutional neural network to convert the encoded observation data into a bird's-eye view feature map and completes target detection and tracking, and obtains object detection boxes and tracking results as communication information; each agent sends the communication information to all collaborating objects;

[0051] S12, each agent calculates its relative pose using the BEVGlue method based on the received communication information, and completes spatial alignment;

[0052] S13, each agent transforms the spatially aligned bird's-eye view features and detection boxes to their respective coordinate systems;

[0053] S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and performs bounding box detection based on the aggregated feature map.

[0054] In the above embodiments of the present invention, observation data is encoded on a single intelligent agent, transformed into a bird's-eye view feature map using a deep convolutional neural network, and target detection is completed. After obtaining the object detection box, interaction is performed, sub-graph matching of salient feature maps is carried out, and spatial differences are estimated to achieve spatial synchronization results.

[0055] In a preferred embodiment of the present invention, in step S11, each agent encodes its observed data, converts it into a bird's-eye view feature map using a deep convolutional neural network, and performs object detection to obtain object detection boxes. The agents then send this communication information to all collaborating objects, and multi-agent collaboration begins. Specifically, this can be described as follows:

[0056]

[0057] in This represents the raw data collected by agent i at time t. This represents the feature map corresponding to the original data. The detection box represents the corresponding original data.

[0058] In a preferred embodiment of the present invention, in step S12, each agent calculates its relative pose using the BEVGlue method based on the received communication information, thus completing spatial alignment. This can be expressed by the following expression:

[0059]

[0060] in This represents the bounding boxes and tracking results generated by agent i at time t; f represents the estimated relative pose of agent j at precise time t between agents i and j. BEVGlue The BEVGlue spatial alignment method utilizes visual features observed by the agent, including relative distances, angles, and high-dimensional features of objects, and models these features as a graph structure. Information is then integrated through graph signal processing or graph neural networks to improve representation capabilities. The BEVGlue spatial alignment method includes the following steps:

[0061] S12.1: The agent extracts the shape of the detection box, tracking results and relative position information, and obtains the feature map of the detected target through graph modeling.

[0062] S12.2: Perform maximum common subgraph search based on the detection box feature map in S12.1, obtain candidate matching node pairs according to the tracking results and shape attributes, and then obtain the maximum common subgraph and corresponding score according to the relative positional relationship between the detection boxes.

[0063] S12.3: Based on the maximum common subgraph and score in S12.2, calculate the relative transformation to obtain the relative pose differences between agents.

[0064] In a preferred embodiment, in implementation S12.1, the target map modeling operation is performed using the following expression:

[0065]

[0066] For each target detected by agent i at time t, each target constitutes a node. The characteristics representing node m. The edge characteristics representing the connection between nodes m and n; Represents the length and width of the target being detected. This represents the tracking results of the detected target. This represents the distance between two detected targets. This represents the angle between two detected targets. This represents the difference in the orientation angles of the two detected targets.

[0067] In a preferred embodiment, S12.2, the maximum common subgraph search algorithm, is implemented, and this operation is performed by the following expression.

[0068]

[0069] in This represents the maximum common subgraph between agents i and j at time t. The target detection map represents agents i and j.

[0070] Furthermore, the model can be implemented in the following way:

[0071] Initialization: In Nodes with matching potential are selected to form an initial matching list. Initially, this is determined by calculating the similarity between nodes. If the similarity exceeds a certain threshold, they are considered a potential matching pair. At non-initial time steps, the initial matching list is obtained based on the maximum common subgraph and the target tracking results from the previous time step.

[0072] Initial Match List Expansion: Take one set from all initial match pairs. Comparison with other potential matching pairs Edge similarity If the edge similarity is greater than a certain threshold, then... It will be added to the matching list.

[0073] Determining the Optimal Match: Due to the irregularities and uncertainties in real-world scenarios, the random initialization assumption may not be satisfied. Therefore, to ensure the robustness of the results, the following strategy is adopted to determine the optimal match for all generated potential initial matches. After removing duplicate common subgraphs and graphs with no more than a threshold number of nodes, we determine the optimal match for each potential match. Calculate confidence score Where C represents a subgraph The size of . In all subgraphs where C is the largest, such that The largest subgraph is the largest common subgraph.

[0074] The above embodiments focus on the pose and geometric relationships of objects during the maximum common subgraph matching process. Compared with traditional methods such as ICP and TESAER, they require less communication and are more robust. Utilizing relative position and shape during the maximum subgraph matching process, they are unaffected by the observation angle, offering high universality and low cost.

[0075] In a preferred embodiment, S12.3 is performed, and the relative position transformation is calculated by solving the following optimization problem:

[0076]

[0077] Where R is a two-dimensional rotation matrix, t is a two-dimensional translation vector, and p i q i For matching point pairs, the solution process uses simple and efficient singular value decomposition, without requiring robust algorithms for outlier removal. C represents the maximum subgraph. Size (number of nodes), R * ,t * Constructing relative pose estimation

[0078] In a preferred embodiment of the present invention, in step S13, the conversion process between the spatially and temporally aligned bird's-eye view feature map and the detection box is performed by the following expression:

[0079]

[0080] in This represents the estimated relative pose between agents i and j. This represents the accurate bird's-eye view of the feature map of agent j at time t. This represents the accurate relative pose of agent j between agents i and j in the space reference frame of agent i. A bird's-eye view of the features below.

[0081] In a preferred embodiment of the present invention, in step S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and updates the bird's-eye view feature map based on the aggregated feature map. And perform detection frame detection to obtain the detection results. f decoder The decoding process from the representative feature map to the detection result is performed through a neural network. It is evident that the agent can automatically determine whether the current collaboration between multiple agents is safe, promptly severing high-risk collaborations and ensuring the security of the collaboration.

[0082] The above embodiments utilize lightweight bounding boxes and target tracking data from collaborative information to construct a target detection graph; maximum subgraph matching is performed using the node and edge features of the target detection graph to achieve spatial alignment. This process is accomplished through a target detection graph modeling graph and a maximum common subgraph detection method, including: feature extraction and graph modeling using individual target detection and tracking results; maximum subgraph search for graph matching on the extracted bounding boxes and tracking information; and relevant transformation calculations on the graph matching results to obtain temporal and pose differences between agents. Therefore, this invention effectively mitigates the impact of spatial noise on collaborative perception.

[0083] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the essence of the present invention. The above preferred features can be used in any combination without conflict.

Claims

1. A collaborative sensing spatial alignment method that does not rely on external positioning and global clock technology, characterized in that, include: S11, each agent encodes the data observed by itself, uses a deep convolutional neural network to convert the encoded observation data into a bird's-eye view feature map and completes target detection and tracking, and obtains object detection boxes and tracking results as communication information; each agent sends the communication information to all collaborating objects; S12, each intelligent agent calculates the relative pose using the BEVGlue method based on the received communication information to complete spatial alignment; S13, each agent transforms the spatially aligned bird's-eye view features and detection boxes to their respective coordinate systems; S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and performs bounding box detection based on the aggregated feature map; The BEVGlue method includes: S12.1: The agent extracts the shape of the detection box, tracking results and relative position information, and obtains the feature map of the detected target through graph modeling; S12.2: Perform a maximum common subgraph search based on the detected target feature map to obtain the maximum common subgraph and its score; S12.3: Based on the maximum common subgraph and score in S12.2, calculate the relative transformation to obtain the relative pose differences between agents; In S12.1, obtaining the target feature map through graph modeling specifically involves: ; For intelligent agents At the point of time The detected targets, each target constitutes a node. Representative node Features Represents the connection node and Edge features; Represents the length and width of the target being detected. The tracking results represent the detected target; This represents the distance between two detected targets. This represents the angle between two detected targets. This represents the difference in the orientation angles of the two detected targets; m and n are loop variables; all nodes need to be traversed to form the intelligent agent. At the point of time Detection target feature map ; In S12.2, a maximum common subgraph search is performed based on the feature map of the detection box, specifically as follows: ; in This represents the maximum common subgraph between agents i and j at time t. Representative moment The maximum common subgraph between agents i and j; The feature map representing the detected target for agents i and j; This represents the maximum common subgraph search model.

2. The collaborative sensing spatial alignment method according to claim 1, which does not rely on external positioning and global clock technology, is characterized in that... In step S11, each agent encodes the data it observes, uses a deep convolutional neural network to convert the encoded observation data into a bird's-eye view feature map, and completes target detection and tracking to obtain object detection boxes and tracking results. Specifically: ; in Representing intelligent agents In time The raw data collected below, This represents a bird's-eye view feature map corresponding to the original data. Represents the object detection bounding box and tracking results corresponding to the original data; This represents the single-unit detection and tracking module.

3. The collaborative sensing spatial alignment method according to claim 1, which does not rely on external positioning and global clock technology, is characterized in that... In step S12, the relative pose is calculated using the BEVGlue method to complete spatial alignment, specifically as follows: ; in Representing intelligent agents In time The generated detection boxes and tracking results; Representing intelligent agents exist Accurate time between two agents The estimated relative pose is below. This represents the BEVGlue spatial alignment method.

4. The collaborative sensing spatial alignment method according to claim 1, which does not rely on external positioning and global clock technology, is characterized in that... The maximum common subgraph search model The implementation process includes: First of all, Nodes with matching potential are selected to form an initial matching list. At the initial moment, the similarity between nodes is calculated. If the similarity exceeds a set threshold, they are considered a potential pair of matching nodes. In non-initial time steps, the initial matching list is obtained based on the maximum common subgraph and target tracking results of the previous time step. Next, for all the initial matching pairs in the initial matching list, take one group. Compare with other potential matching pairs Edge similarity If the edge similarity is greater than a set threshold, It will be added to the matching list; Finally, duplicate common subgraphs and graphs with no more than a threshold number of nodes are removed, and each potential match is processed. Calculate confidence score Where C represents a subgraph The size of; in all subgraphs with the largest C, such that The largest subgraph is the largest common subgraph.

5. The collaborative sensing spatial alignment method according to claim 4, which does not rely on external positioning and global clock technology, is characterized in that... S12.3, based on the maximum common subgraph and score in S12.2, calculates the relative transformation to obtain the relative pose differences between agents, including: ; Where R is a two-dimensional rotation matrix and t is a two-dimensional translation vector. For the matched pairs of points, C represents the maximum subgraph. The number of nodes, Constructing relative pose estimation .

6. The collaborative sensing spatial alignment method according to claim 1, which does not rely on external positioning and global clock technology, is characterized in that... S13, each agent transforms the spatially aligned bird's-eye view features and detection boxes to its own coordinate system, specifically: ; in Representing intelligent agents Estimating relative pose between them Representing intelligent agents In time Accurate aerial view of features; Representative in intelligent agents Intelligent agents in a spatial reference frame exist Accurate relative pose between two agents The following is an aerial view of the features. This represents the coordinate transformation formula.

7. The collaborative sensing spatial alignment method according to claim 1, which does not rely on external positioning and global clock technology, is characterized in that... S14, the agent aggregates its own bird's-eye view feature map and the bird's-eye view feature map after coordinate transformation, and performs bounding box detection based on the aggregated feature map, specifically: ; in Representing intelligent agents In its own clock Detection results of the time-lapse bounding box; The decoding process from the representative feature map to the detection result is performed through a neural network. This represents aggregation characteristics.