A strong interactive driving scene coding and unified retrieval method

By adopting a strong interactive driving scenario coding method based on traffic rule priors and data-driven attention, the problems of insufficient coverage of strong interactive scenarios and redundancy of similar retrieval in autonomous driving are solved, and efficient differentiation of interactive topology types and comprehensive coverage of scenario types are achieved.

CN122019846BActive Publication Date: 2026-06-19TONGJI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TONGJI UNIV
Filing Date
2026-04-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively cover highly interactive scenarios in autonomous driving, suffer from insufficient generalization, redundant similar retrieval, low efficiency in finding difficult examples, and difficulty in fairly comparing and switching between different retrieval frameworks under the same objective function.

Method used

We adopt a highly interactive driving scene encoding method based on traffic rule priors and data-driven attention. By constructing frame-by-frame dual features, multi-head temporal attention encoding, and fusing temporal patterns with traffic rule priors, combined with clustering and greedy retrieval, we achieve interpretable encoding and unified retrieval of scene embedding vectors.

Benefits of technology

It improves the ability to distinguish interaction topology types, reduces near-duplicate search results, and increases scene type coverage, making it suitable for training data completion and hard example discovery.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122019846B_ABST
    Figure CN122019846B_ABST
Patent Text Reader

Abstract

This invention relates to the fields of intelligent transportation, autonomous driving data processing, and scene retrieval technology, and particularly to a highly interactive driving scene encoding and unified retrieval method. The invention first constructs frame-by-frame dual feature vectors, and then uses multi-head temporal attention to perform weighted pooling on these frame-by-frame features to extract temporal pattern features representing the interaction evolution process. Prior traffic rule vectors are introduced for encoding, and after block normalization and weighted concatenation, a scene embedding vector for similarity calculation is obtained. Based on a pre-built embedding index library and scene clustering labels, a unified sub-modulus retrieval objective function is proposed, achieving similar scene retrieval and diversified coverage retrieval within the same greedy framework, outputting a set of scenes that meet relevance and coverage requirements. This invention can be used for hard example discovery, training data completion, and curriculum construction, reducing redundant retrieval and improving scene type coverage. This invention has good versatility, scalability, and interpretability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent transportation, autonomous driving data processing and scene retrieval technology, and in particular to a highly interactive driving scene encoding and unified retrieval method. Background Technology

[0002] In tasks such as motion prediction, behavior decision-making, and safety assessment in autonomous driving, training data needs to cover various road topologies and multi-agent interaction behaviors. However, strong interaction scenarios (such as yielding to oncoming traffic, cross-traffic conflicts, and lane-changing competition) account for a relatively small proportion of natural driving data, and random sampling often fails to cover these rare but crucial interaction types.

[0003] Existing technologies typically suffer from the following shortcomings: insufficient generalization, as key conflict segments of focus interaction pairs are easily overlooked when searching using only average features of the entire scene or local distance metrics; low efficiency in hard example discovery, as screening based on single indicators such as distance or collision time cannot reliably reflect the scene's predictive difficulty for downstream models; redundant search results, as sorting solely by cosine similarity easily returns a large number of nearly identical scenes, resulting in insufficient coverage when completing the training set; and fragmented search frameworks, as similarity retrieval and diversified retrieval often use different algorithms, making it difficult to conduct fair comparisons and switch on demand under the same objective function.

[0004] Therefore, there is an urgent need for a unified retrieval method that can achieve interpretable encoding for highly interactive scenarios and simultaneously take into account relevance and coverage within the same framework. Summary of the Invention

[0005] The purpose of this invention is to provide a method for encoding and retrieving highly interactive driving scenarios based on prior knowledge of traffic rules and data-driven attention, aiming to solve problems such as insufficient coverage caused by the rarity of highly interactive scenarios, redundancy in similar retrieval, and inconsistency between heuristic filtering and model difficulty.

[0006] The objective of this invention is achieved through the following technical solution:

[0007] A highly interactive driving scenario encoding and unified retrieval method includes the following steps:

[0008] S1: Acquire data from highly interactive driving scenarios; collect or read the state sequences of traffic participants across multiple time frames, and determine the focus of the interaction. At the same time, it acquires scene metadata related to the interaction subject;

[0009] S2: Frame-by-frame dual feature construction; selecting interactive subject pairs from the multiple time frames. The set of co-visible frames is determined, and a normalized dual feature vector is calculated for each co-visible frame. Forming a sequence of dual eigenvectors ;

[0010] S3: Data-driven multi-head temporal attention encoding; based on the dual feature vector sequence Construct multi-head temporal attention weights, perform weighted pooling on the dual feature vector sequence, and concatenate them to obtain attention feature blocks. ;

[0011] S4: Fusion of temporal patterns and prior traffic rules; extraction of temporal pattern feature blocks from the dual feature vector sequence. And construct a priori feature block of traffic rules based on the scene metadata. ;right Divide into blocks respectively Normalize and weight concatenate to obtain the scene embedding vector. ;

[0012] S5: Offline index construction and clustering; Steps S1 to S4 are executed for multiple highly interactive driving scenarios in the scenario library to obtain the corresponding scenario embedding vector set. Clustering is performed based on the scenario embedding vector set to obtain clustering labels, and the uniqueness score of each scenario is calculated.

[0013] S6: Unified sub-module greedy search; upon receiving a query scenario Then, based on the scene embedding vector corresponding to the query scene The marginal gain function is constructed by comparing the similarity to embedded vectors in the index, and combining clustering coverage gain, uniqueness score, and redundancy penalty term. A greedy iterative approach is used to select... Each scenario forms a set of search results S, where These are candidate scenarios.

[0014] Beneficial effects

[0015] Compared with the prior art, the present invention has the following advantages:

[0016] (1) Construct interpretable scene embeddings oriented towards the focus of interaction subjects. By using frame-by-frame dual features and data-driven multi-head attention aggregation, key conflict frames can be highlighted and irrelevant segments can be suppressed, thus achieving an effective characterization of the interaction evolution process.

[0017] (2) Introducing prior traffic rules to enhance topological consistency. By encoding rule information such as priority, speed limit, turning and path relationship into prior feature blocks, the ability to distinguish interactive topological types can be improved without relying on supervised training of the encoder.

[0018] (3) The unified sub-module retrieval framework takes into account both relevance and coverage. By integrating similarity, coverage gain, uniqueness reward and redundancy penalty in the same marginal gain function, it can reduce near-duplicate retrieval results and improve scene type coverage, which is suitable for applications such as training data completion, course learning and hard case discovery. Attached Figure Description

[0019] Figure 1 This is a flowchart of the method of the present invention;

[0020] Figure 2 This is a schematic diagram of the overall system architecture of the present invention;

[0021] Figure 3 This is a schematic diagram of the encoder structure in the core scenario of this invention;

[0022] Figure 4 This is a schematic diagram of the offline construction and online retrieval process of the present invention. Detailed Implementation

[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort should fall within the scope of protection of the present invention.

[0024] A highly interactive driving scenario encoding and unified retrieval method, the overall process of which is as follows: Figure 1 As shown, it includes the following steps:

[0025] S1: Acquire data from highly interactive driving scenarios; collect or read the state sequences of at least two traffic participants across multiple time frames, and determine the focus of the interaction. Simultaneously, scene metadata related to the interaction subject is acquired;

[0026] S2: Frame-by-frame dual feature construction; selecting interactive subject pairs from the multiple time frames. The set of co-visible frames is determined, and a normalized dual feature vector is calculated for each co-visible frame. Forming a sequence of dual eigenvectors ;

[0027] S3: Data-driven multi-head temporal attention encoding; based on the dual feature vector sequence Construct multi-head temporal attention weights, perform weighted pooling on the dual feature vector sequence, and concatenate them to obtain attention feature blocks. ;

[0028] S4: Fusion of temporal patterns and prior traffic rules; extraction of temporal pattern feature blocks from the dual feature vector sequence. And construct a priori feature block of traffic rules based on the scene metadata. ;right Divide into blocks respectively Normalize and weight concatenate to obtain the scene embedding vector. ;

[0029] S5: Offline index construction and clustering; Steps S1 to S4 are executed for multiple highly interactive driving scenarios in the scenario library to obtain the corresponding scenario embedding vector set. Clustering is performed based on the scenario embedding vector set to obtain clustering labels, and the uniqueness score of each scenario is calculated.

[0030] S6: Unified sub-module greedy search; upon receiving a query scenario Then, based on the scene embedding vector corresponding to the query scene The marginal gain function is constructed by comparing the similarity to embedded vectors in the index, and combining clustering coverage gain, uniqueness score, and redundancy penalty term. A greedy iterative approach is used to select... Each scenario forms a set of search results S, where These are candidate scenarios.

[0031] Specifically, in step S1,

[0032] The highly interactive driving scenario focuses on the interaction subject. As the central organization, in which This refers to the first subject in a focus interaction subject pair. This refers to the second subject. The focus interaction subject pair is extracted from the multi-subject interaction relationships within the scene, or it can be specified externally. In a multi-subject scene, the subject pair with the highest interaction strength can be selected as the focus interaction subject pair based on the interaction strength index. .

[0033] It should be noted that the scene encoding and unified retrieval of this invention take "obtained highly interactive scene fragments" as input, and the determination / filtering of highly interactive elements can be done by identifying the focus interaction subject. The interaction event extraction module for the corresponding strong interaction scene segment is completed offline. The present invention itself does not limit the specific judgment rules. In another optional implementation, it is also possible to determine online whether it is a strong interaction scene segment based on indicators such as PET threshold, interaction intensity or minimum TTC in step S1.

[0034] In an optional implementation, the strongly interactive scene fragments used in step S1 are pre-extracted from the dataset by the interactive event extraction module, and the original scene number is extracted from the results. ), interactive subject identifier ( ), start frame ( ), end frame ( ) and metadata ( Positioning is performed using the alignment information in the data; where This field records the identifiers of the two interacting entities, which can be parsed to obtain the focus interaction entity pair. Each subject in a time frame The state includes at least the position vector. Velocity vector With heading angle In addition, it may include intent tags. Visibility markers and post-intrusion time (Post Encroachment Time) Meta-information.

[0035] The scene metadata includes at least one of the following: Post-Intrusion Time (PET), Path Relation, PathCategory, TurnLabel, PriorityLabel, and SpeedLimit.

[0036] In step S2, the set of commonly visible frames refers to the main body in multiple time frames. With the main body The set of time frames that were all effectively observed is denoted as ,in The total number of visible frames and .

[0037] The dual feature vector sequence It consists of multiple dual feature vectors arranged in the temporal order of the co-visible frames. The sequence formed.

[0038] The specific calculation process is as follows:

[0039] For the interactive subject Each frame in the set of commonly visible frames Calculate relative displacement, relative velocity, and distance:

[0040] relative displacement vector It can be represented as:

[0041]

[0042] in, and The main body With the main body At any moment The position vector, as the main body relative subject The displacement vector.

[0043] Interbody distance It can be represented as:

[0044]

[0045] in, Represents the L2 norm. as the main body With the main body At any moment The Euclidean distance.

[0046] Relative velocity vector between the main bodies It can be represented as:

[0047]

[0048] in, and The main body With the main body At any moment The velocity vector, This represents the relative velocity vector between the main bodies.

[0049] The relative displacement, relative velocity, and distance are used to characterize the instantaneous geometric relationship and motion differences between the two subjects, and serve as the basic input for subsequent conflict quantity calculations.

[0050] The approach velocity is further defined and used to characterize the two subjects at time [time]. Proximity:

[0051]

[0052] in, For a moment The approach speed; when the two subjects move away from each other, the approach speed is taken as 0; Represents the dot product of vectors; Used to avoid a denominator of 0; It is a very small constant.

[0053] Further calculations are performed to determine the effective clearance after considering the vehicle's geometry:

[0054]

[0055] in, For effective gap, For vehicle geometric corrections, generally set as follows: This is used to compensate for the impact of vehicle size on the clearance distance between the main bodies.

[0056] Based on the approach velocity and effective clearance, calculate the geometrically corrected collision time (TTC):

[0057]

[0058] in, For a moment Geometrically corrected collision time, This is the upper limit for collision time; in an optional implementation, when And when the approach velocity is greater than 0, Set to 0.

[0059] Further calculations are needed to avoid the collision. And normalize it:

[0060]

[0061] in, This represents the deceleration required to avoid a collision in the current approach state.

[0062]

[0063] in, To preset the critical deceleration, a value of 3.4 is chosen in one optional implementation. ; for The normalized result.

[0064] In an optional implementation, the dual eigenvector of step S2 The features are constructed as shown in Table 1, and each feature is cropped to the [0,1] interval. The 10 features correspond sequentially to geometric corrections. Distance between main bodies Approach speed Deceleration required to avoid a collision Intention Conflict Score ,main body speed ,main body speed Lateral bearing, lateral offset, and heading consistency. (Table 1) This indicates the operation of finding the minimum value. This indicates that the speed is close to its maximum value. Indicates geometric correction Maximum value This indicates the maximum distance between the main components.

[0065] Among them, the intention conflict score The lateral orientation represents the potential conflict intensity predicted from short-time trajectories based on the intentions of the two subjects; the lateral orientation is the angle between the relative orientations of the two subjects. The lateral component representing the relative orientation of the two subjects is used. Normalization is performed; lateral offset is achieved through the lateral offset of the two main bodies. Characterizes the lateral distance between the two main bodies, and uses Normalization is performed. The dual eigenvector is the sum of the aforementioned relative displacements. Relative velocity ,distance Approach speed Effective gap Geometric correction collision time Deceleration required to avoid a collision The 10-dimensional dual-subject feature representation is obtained by normalizing, pruning, or risk mapping quantities such as the main velocity and heading relationship according to the preset mapping relationship.

[0066]

[0067] In step S3, for the dual feature vector sequence Constructing multiple attention heads Each attention head generates a query vector based on different conflict signals. and through The function obtains attention weights :

[0068]

[0069] in, For the first Attention is focused at all times The query signal, For the corresponding attention weights, This is a temperature parameter used to control the sharpness of attention distribution. For the total number of attention heads, In this embodiment, Let it be 5.

[0070] In this embodiment, five attention heads are set: Attention head, distance attention head, intention conflict attention head, Attention and Decline rate attention head; query vector The calculation method is as follows:

[0071]

[0072]

[0073]

[0074]

[0075]

[0076] in, for The normalized value, Distance The normalized value, For a moment Intentional conflict score;

[0077] Used to emphasize smaller collision frames in TTC Used to emphasize frames that are close together. Used to emphasize frames with significant conflicting intentions. Used to emphasize frames with high braking requirements; Introducing cross-frame difference terms to explicitly emphasize moments when the level of danger escalates rapidly, when season .

[0078] After obtaining the weights of each attention head, weighted pooling is performed on the dual feature vector sequence to obtain the aggregate vector of each attention head. :

[0079]

[0080] in, For the first The aggregation vector corresponding to each attention head. The total number of visible frames.

[0081] Concatenate the aggregated vectors of all attention heads to obtain the attention feature block. :

[0082]

[0083] in, These are the aggregated vectors obtained from each attention head;

[0084] Frame-by-frame dual feature vector The dimension is In this embodiment, and ,therefore It is a 50-dimensional vector.

[0085] In step S4, from the dual eigenvector sequence Extracting temporal pattern feature blocks It is used to characterize the evolutionary trend of the interaction process.

[0086] The time-series pattern feature block A 12-dimensional vector, including: minimum Percentage of times when conflict peaks occur Trends, Persistence of Danger, Peak value, peak value of intent conflict, mean value of intent conflict Scoring, path intersection probability, intersection path category indicator, merging path category indicator, and co-visible frame coverage.

[0087] Among them, the smallest Depend on The percentage of times when conflict peaks occurred was obtained from... get; The trend is The slope of the linear fitting sequence The mapping yields the proportion of danger persisting, determined by satisfying the following conditions: The frame percentage was obtained; The peak value, the peak value of intent conflict, and the mean value of intent conflict are obtained by taking the maximum value or the average value of the corresponding sequences, respectively. The score is determined by The path crossing probability is obtained by inverse compression and truncation normalization; the path crossing probability is extrapolated from the future short-term motion of two subjects near the peak collision frame; the crossing path category indicator and the merging path category indicator are respectively obtained by... The indicator variables are obtained when it belongs to the cross-type and merging type; the total visible frame coverage is obtained from... get.

[0088] in, The frame index where TTC reaches its minimum value; This represents the percentage of times when the conflict peak occurs. The trend score representing the collision time TTC is used to characterize the direction and intensity of change in interaction risk over time. To The slope coefficient obtained by linear fitting of the sequence; For the time of intrusion The normalized score is obtained. The calculation formula is as follows:

[0089]

[0090]

[0091]

[0092]

[0093] In step S4, a priori feature block of traffic rules is constructed. The aforementioned This includes one-hot encoding of continuous rule feature vectors B and categorical rule features, along with temporal pattern feature blocks. They jointly participated in the construction of scene embedding vectors.

[0094]

[0095] in, Continuous rule feature subvectors are used to encode traffic rule information, including priority, yield, speed limits, turning, and path risk; categorical rule features are encoded using one-hot encoding to encode path relationship categories and symmetric combination categories of turning, where... This is the one-hot encoded sub-vector corresponding to the path relationship category. The one-hot encoded subvectors corresponding to the symmetric combination category can be transformed and their contribution to similarity calculation can be enhanced by a preset scaling factor.

[0096] In this embodiment, It is 8-dimensional. It is 8-dimensional. for Wei, therefore for Wei; when hour, It is 22-dimensional.

[0097] The one-hot encoded subvector corresponding to the path relationship category is represented as follows:

[0098]

[0099] in, This is the scaling factor for the one-hot encoding corresponding to the path relationship category; in this embodiment, Take 3.0; Path relationship category The index in the preset vocabulary, and The preset vocabulary is a set of categories formed by arranging path relationship categories in a preset order, for example, it can be set as {parallel-in, cross-cross, parallel-parallel, oblique-cross, parallel-cross, cross-in, cross-parallel, cross-oblique}; These are the standard basis vectors corresponding to the path relationship categories. The standard basis vectors are those derived at the [missing information]th [missing information]. A one-hot basis vector with one dimension and zero in all other dimensions. For example, when the path relationship category... When it is a cross-cross and its index is 1, .

[0100] The one-hot encoded subvector corresponding to the steering symmetric combination category is represented as follows:

[0101]

[0102] in, The scaling factor for the one-hot encoding corresponding to the symmetric combination category; These are the standard basis vectors corresponding to the orientation-symmetric combination category index. In this embodiment, Version 2.0 is acceptable; For the turn label combination The corresponding category index. By symmetrizing the turning combinations, the encoding remains unchanged when the subject order is changed. In an optional implementation, let straight line be S, left turn be L, and right turn be R, the symmetric combination category vocabulary can be set as {(S,S), (L,L), (R,R), (L,S), (R,S), (L,R)}. For example, when the subject... The steering label is L, main body When the turning label is S, the combination after symmetry processing is still denoted as (L,S). If its category index is 3, then the corresponding standard basis vector is... .

[0103] In an optional implementation, continuous regular feature vectors It is an 8-dimensional vector, as shown in Table 2. Wherein... This indicates a cropping operation. This indicates the operation of calculating the average value. Speed ​​limit.

[0104]

[0105] Among them, gap acceptance :

[0106]

[0107] The critical gap threshold; when Smaller gaps indicate a greater willingness to accept interactions that are more urgent.

[0108] Furthermore, for attention feature blocks Temporal pattern feature blocks Prior feature blocks of traffic rules Divide into blocks respectively After normalization, weighted concatenation is performed, followed by global normalization to obtain the final scene embedding vector. :

[0109]

[0110] in, , , They represent , , The block-based L2 normalized result, This indicates that a global L2 normalization operation is performed on the splicing result.

[0111] , , The weight coefficients are for the attention feature block, the temporal pattern feature block, and the traffic rule prior feature block, respectively.

[0112]

[0113] In this embodiment, the scene embedding vector dimension for:

[0114]

[0115] in, For the number of attention heads, 12 represents the dimension of the frame-by-frame dual feature vector, and 12 represents the temporal pattern feature block. Dimensionality, 22 for prior feature blocks of traffic rules Dimension The dimension of the embedded vector for the scene. In this embodiment, Take 84.

[0116] in, , These represent the blocks corresponding to the feature blocks. Normalization results Represents the global Normalization operation.

[0117] like Figure 4 As shown, this invention employs a combination of offline construction and online retrieval: in the offline stage, scene embedding vectors are calculated in batches for the scene library, and indexes and clustering information are established; in the online stage, query embedding vectors are calculated for the query scene, and a unified sub-modulus greedy retrieval is performed, outputting a set of retrieval results. .

[0118] Steps S5 and S6 correspond to the offline construction stage and the online retrieval stage, respectively.

[0119] Step S5, the offline stage, is based on the scene embedding vector of each scene in the scene library. A retrieval index is constructed. The scene library consists of multiple highly interactive scene fragments extracted offline from the traffic scene dataset by the interactive event extraction module; these highly interactive scene fragments are organized according to scene identifier, interactive subject identifier, and starting frame alignment. In an optional implementation, the pre-computed scene similarity matrix S and distance matrix D are as follows:

[0120]

[0121]

[0122] in, For scene indexing, For the scene With Scene cosine similarity, The distance is the cosine distance.

[0123] Furthermore, the scene embedding vector set is clustered to obtain cluster labels. ,in Representing a scene The cluster number of the interaction type to which it belongs; when two scenes have the same This indicates that the two belong to the same cluster in the embedding space; when A value less than 0 indicates an outlier. In an optional implementation, density clustering or K-means clustering is used, and the optimal number of clusters is selected within a preset range of K based on the silhouette coefficient.

[0124] Uniqueness score It is obtained by combining the distance from the scene to its nearest neighbor scene and the average distance from the scene to all other scenes:

[0125]

[0126]

[0127] in, For the scene The distance to its nearest neighbor scene. For the scene The average distance to all other scenes This refers to the size of the scene library.

[0128]

[0129] Where N is the size of the scene library, and These are the maximum nearest neighbor distance and the maximum average distance for the entire scene, respectively, used for normalization; For outlier indicator functions, when the scenario The value is 1 if the scene is an outlier, and 0 otherwise. An outlier reward is introduced when the scene is an outlier to increase its exploration priority.

[0130] Without providing a query scenario, an exploration sequence covering the scene space can be generated through farthest point sampling (FPS):

[0131]

[0132]

[0133] in, Indicates the farthest point sampling at the th The scene index selected step by step; For the scene Uniqueness score; For the scene With the selected scene The distance between them Representing candidate scenarios The shortest distance to the selected set; This represents the independent variable that maximizes the objective function. The sampling process prioritizes scenes furthest from the already selected set to gradually expand the scene space coverage.

[0134] In step S6, cosine similarity is calculated based on the scene embedding vector, and a marginal gain function for unified sub-modulus retrieval is constructed. .

[0135] Cosine similarity as follows:

[0136]

[0137] in, For query scenarios Scene embedding vector, Candidate scenarios The scene embedding vector; since the embedding vector has been L2 normalized.

[0138] Marginal gain function:

[0139]

[0140]

[0141] in, These are relevance weight, coverage weight, uniqueness reward weight, and redundancy penalty weight, respectively. For the coverage gain indicator function, when the candidate scene The cluster labels have not yet been collected. Set the value to 1 if the function is overwritten, otherwise set it to 0. Candidate scenarios Uniqueness score; The similarity between the candidate scene and the most similar scene in the selected set is used to suppress redundancy.

[0142] In an optional implementation, by setting different Weight combinations can achieve similar retrieval, maximum marginal relevance retrieval (MMR), diverse coverage retrieval, balanced retrieval, and adaptive weight retrieval (auto) within the same retrieval framework. Table 3 provides an example of a weight configuration.

[0143]

[0144] in, This is an adjustable parameter with a value range of [0,1]; when When the value is larger, it tends to favor correlation. When the value is smaller, it tends to favor diversity. The adaptive mode can dynamically set weights based on the uniqueness score of the query scenario and the local density of its cluster, in order to balance stable relevance with necessary exploration.

[0145] A greedy iterative approach is used to progressively select from the candidate scenario set. Add the largest item to the set until selected Each scenario can be analyzed. By setting different weight combinations, users can switch between similarity search and diversified coverage search, and the weights can be adaptively adjusted according to the uniqueness and local density of the query scenario.

[0146] In an optional implementation, for query scenarios not found in the index, the query embedding vector can be calculated online according to steps S1 to S4. Then, a similarity vector is calculated with the embedded vector in the index and a unified sub-modulus greedy search is performed to achieve similarity and coverage retrieval for off-database queries.

[0147] like Figure 2 The overall system architecture of this invention includes a multi-source data access and preprocessing module, an interactive event extraction module, a scene coding and retrieval index construction module, and a scene retrieval module. Specifically, the multi-source data access and preprocessing module is used to access publicly available driving data and self-collected data, and to perform unified processing on trajectory data, road topology information, and environmental information; the interactive event extraction module is used to identify the focus of the interaction. It also extracts corresponding highly interactive scene fragments; the scene encoding and retrieval index construction module combines the interaction trajectory sequence, temporal information, and local map topology to generate scene embedding vectors. As a searchable index, the scene retrieval module is used for offline index building and clustering, and for unified sub-modal greedy retrieval to obtain scene subsets.

[0148] like Figure 3 As shown, the scene encoding module includes a preprocessing unit, a five-head attention encoding unit, a spatiotemporal feature extraction unit, a rule injection unit, and a weighted synthesis unit. The preprocessing unit extracts basic features from trajectory data and metadata; the five-head attention encoding unit... Distance, conflict of intent, The five attention heads are encoded to escalate the risk; the spatiotemporal feature extraction unit extracts the minimum... ,maximum The system incorporates statistical features such as trends and persistence; the rule injection unit introduces rule features, topological features, and turning pair features; the weighted synthesis unit performs weighted concatenation and normalization on various features to obtain the scene embedding vector. .

[0149] Example: Validation of search results

[0150] This embodiment uses strong interaction subject pairs in the unified cache of the INTERACTION dataset as the sample source and uses PathRelation consistency as the binary correlation criterion to evaluate the unified retrieval mode embedded by the strong interaction scene encoder of this invention.

[0151] Evaluation indicators include and diversity-related indicators , and Hyperparameters This represents the truncated number of results for diversity assessment; that is, after completing the retrieval for the query scenario, the top-ranked results are taken. The search results constitute an evaluation set, and diversity indicators such as coverage, intra-set dissimilarity, and redundancy are calculated based on this set. In this project embodiment, the following is set: in, Indicates the preceding The proportion of relevant results in each search result. This represents the mean of the precision across all queries. This indicates the proportion of different clusters covered by the first 5 search results out of the total number of clusters. This represents the average of the pairwise dissimilarity scores within the first 5 search results. This indicates the percentage of the top 5 search results that share the same clustering label as earlier selected results.

[0152] The calculation formula is as follows:

[0153]

[0154]

[0155] in, Indicates sorting position Is the result relevant? Represents the query set. Indicates query The corresponding average accuracy.

[0156]

[0157]

[0158] in, This is a set consisting of the first 5 search results. The total number of all valid clusters, and for The scene in the film.

[0159]

[0160] in, For sorting position The search results at the current location; when the result at the current location belongs to the same cluster as the result at an earlier location, it is considered a redundant hit.

[0161] Table 4 shows examples of retrieval metrics embedded in the encoder for highly interactive scenarios. The cutoff position for diversity-related indicators is 5 in this embodiment.

[0162]

[0163] Table 4 shows that the unified sub-modulus retrieval mechanism proposed in this invention is not limited to performing a single similarity retrieval. The balanced mode maintains a high level of similarity across both full-data and subset-data datasets. and At the same time, significantly improve and And The value drops to near 0, indicating that the present invention can achieve a stable balance between relevance and coverage; the diverse mode further improves the internal variability of the set, indicating that the coverage term and redundancy penalty term in the marginal gain function can effectively expand the coverage of scene types.

[0164] The above description is merely a description of preferred embodiments of this application and is not intended to limit the scope of this application in any way. Any changes or modifications made by those skilled in the art based on the above-disclosed technical content should be considered as equivalent and valid embodiments and fall within the scope of protection of the technical solution of this application.

Claims

1. A highly interactive driving scenario encoding and unified retrieval method, characterized in that, Includes the following steps: S1: Acquire data from highly interactive driving scenarios; collect or read the state sequences of traffic participants across multiple time frames, and determine the focus of the interaction. At the same time, it acquires scene metadata related to the interaction subject; S2: Frame-by-frame dual feature construction; selecting interactive subject pairs from the multiple time frames. The set of co-visible frames is determined, and a normalized dual feature vector is calculated for each co-visible frame. Forming a sequence of dual eigenvectors ; S3: Data-driven multi-head temporal attention encoding; Based on the dual feature vector sequence Construct multi-head temporal attention weights, perform weighted pooling on the dual feature vector sequence, and concatenate them to obtain attention feature blocks. ; S4: Fusion of temporal patterns and prior traffic rules; extraction of temporal pattern feature blocks from the dual feature vector sequence. And construct a priori feature block of traffic rules based on the scene metadata. ;right Divide into blocks respectively Normalize and weight concatenate to obtain the scene embedding vector. ; S5: Offline index construction and clustering; Steps S1 to S4 are executed for multiple highly interactive driving scenarios in the scenario library to obtain the corresponding scenario embedding vector set. Clustering is performed based on the scenario embedding vector set to obtain clustering labels, and the uniqueness score of each scenario is calculated. S6: Unified sub-module greedy search; upon receiving a query scenario Then, based on the scene embedding vector corresponding to the query scene The marginal gain function is constructed by comparing the similarity to embedded vectors in the index, and combining clustering coverage gain, uniqueness score, and redundancy penalty term. A greedy iterative approach is used to select... Each scenario forms a set of search results S, where These are candidate scenarios.

2. The method according to claim 1, characterized in that, In step S1, The highly interactive driving scenario focuses on the interaction subject. As the central organization, in which This refers to the first subject in a focus interaction subject pair. The second subject is indicated; the focus interaction subject is extracted from the multi-subject interaction relationship within the scene, or is specified externally; The scene metadata includes at least one of the following: Post-Intrusion Time (PET), Path Relation, PathCategory, TurnLabel, PriorityLabel, and SpeedLimit.

3. The method according to claim 2, characterized in that, The highly interactive scene fragments used in step S1 are pre-extracted from the dataset, and the original scene numbers in the extraction results are used to identify them. Interactive subject identifier , start frame End frame and metadata Positioning based on alignment information; where The field records the identifiers of the two interacting entities, and the focus interaction entity pair can be obtained by parsing. Each subject in time frame The state includes at least the position vector. velocity vector With heading angle In addition, it may include intent tags. Visibility markers and post-intrusion time Meta-information.

4. The method according to claim 1, characterized in that, In step S2, the set of commonly visible frames refers to the main body in multiple time frames. With the main body The set of time frames that were all effectively observed is denoted as ,in The total number of visible frames and ; The dual feature vector sequence It consists of multiple dual feature vectors arranged in the temporal order of the co-visible frames. The sequence formed.

5. The method according to claim 1, characterized in that, In step S2, the dual eigenvector includes: geometric correction. Distance between main bodies Approach speed Deceleration required to avoid a collision Intention Conflict Score ,main body speed ,main body speed Lateral bearing, lateral offset, and heading consistency.

6. The method according to claim 1, characterized in that, In step S3, For dual eigenvector sequences Constructing multiple attention heads Each attention head generates a query vector based on different conflict signals. and through The function obtains attention weights : in, For the first Attention is focused at all times The query signal, For the corresponding attention weights, This is a temperature parameter used to control the sharpness of attention distribution. For the total number of attention heads, ; After obtaining the weights of each attention head, weighted pooling is performed on the dual feature vector sequence to obtain the aggregate vector of each attention head. : in, For the first The aggregation vector corresponding to each attention head. The total number of visible frames; Concatenate the aggregated vectors of all attention heads to obtain the attention feature block. .

7. The method according to claim 6, characterized in that, Set 5 attention points: Attention head, distance attention head, intention conflict attention head, Attention and Decline rate attracts attention; Query vector The calculation method is as follows: in, for The normalized value, Distance The normalized value, For a moment Intentional conflict score.

8. The method according to claim 6, characterized in that, In step S4, From the dual eigenvector sequence Extracting temporal pattern feature blocks This is used to characterize the evolutionary trend of the interaction process; the temporal pattern feature block A 12-dimensional vector, including: minimum Percentage of times when conflict peaks occur Trends, Persistence of Danger, Peak value, peak value of intent conflict, mean value of intent conflict Score, path intersection probability, intersection path category indicator, merging path category indicator, and co-visible frame coverage; The traffic rule prior feature block This includes one-hot encoding of continuous rule feature vectors B and category rule features, along with temporal pattern feature blocks. They jointly participate in the construction of scene embedding vectors; in, Continuous rule feature subvectors are used to encode traffic rule information, including priority, yield, speed limits, turning, and path risk; categorical rule features are encoded using one-hot encoding to encode path relationship categories and symmetric combination categories of turning, where... This is the one-hot encoded sub-vector corresponding to the path relationship category. To transform the one-hot encoded subvectors corresponding to the symmetric combination category, and to enhance their contribution to similarity calculation by a preset scaling factor; The one-hot encoded subvector corresponding to the path relationship category is represented as follows: in, The scaling factor is the one-hot encoding corresponding to the path relationship category; Path relationship category Index in the preset vocabulary; These are the standard basis vectors corresponding to the path relationship categories; The one-hot encoded subvector corresponding to the symmetric combination category is represented as follows: in, The scaling factor for the one-hot encoding corresponding to the symmetric combination category; These are the standard basis vectors corresponding to the orientation symmetric combination category index; For the turn label combination Corresponding category index; Attention feature blocks Temporal pattern feature blocks Prior feature blocks of traffic rules Divide into blocks respectively After normalization, weighted concatenation is performed, followed by global normalization to obtain the final scene embedding vector. : in, , , They represent , , The block-based L2 normalized result, This indicates that a global L2 normalization operation is performed on the splicing result; , , These are the weight coefficients for the attention feature block, the temporal pattern feature block, and the traffic rule prior feature block, respectively.

9. The method according to claim 1, characterized in that, Step S5, the offline stage, is based on the scene embedding vector of each scene in the scene library. Build a search index; The scene library consists of multiple highly interactive scene fragments extracted offline from the traffic scene dataset by the interactive event extraction module; the highly interactive scene fragments are organized according to scene identifier, interactive subject identifier, and starting frame alignment. The pre-computed scene similarity matrix S and distance matrix D are as follows: in, For scene indexing, For the scene With scene cosine similarity, Cosine distance; Cluster the scene embedding vector set to obtain cluster labels. ,in Representing a scene The cluster number of the interaction type to which it belongs; when two scenes have the same This indicates that the two belong to the same cluster in the embedding space; when A value less than 0 indicates an outlier. Uniqueness score It is obtained by combining the distance from the scene to its nearest neighbor scene and the average distance from the scene to all other scenes: in, For the scene The distance to its nearest neighbor scene. For the scene The average distance to all other scenes For the size of the scene library; Where N is the size of the scene library, and These are the maximum nearest neighbor distance and the maximum average distance for the entire scene, respectively, used for normalization; For outlier indicator functions, when the scenario The value is 1 if it is an outlier, and 0 otherwise. When a scene is an outlier, an outlier reward is introduced to increase its exploration priority. Without providing a query scenario, an exploration sequence covering the scene space is generated by sampling the FPS at the furthest point: in, Indicates the farthest point sampling at the th The scene index selected step by step; For the scene Uniqueness score; For the scene With the selected scene The distance between them Representing candidate scenarios The shortest distance to the selected set; This represents the independent variable that maximizes the objective function; the sampling process prioritizes selecting the scene furthest from the already selected set to gradually expand the scene space coverage.

10. The method according to claim 1, characterized in that, In step S6, Cosine similarity is calculated based on scene embedding vectors, and a marginal gain function for unified sub-modulus retrieval is constructed. ; Cosine similarity as follows: in, For query scenarios Scene embedding vector, Candidate scenarios The scene embedding vector; since the embedding vector has been L2 normalized; Marginal gain function: in, These are relevance weight, coverage weight, uniqueness reward weight, and redundancy penalty weight, respectively. For the coverage gain indicator function, when the candidate scene The cluster labels have not yet been collected. Set the value to 1 if the function is overwritten, otherwise set it to 0. Candidate scenarios Uniqueness score; The similarity between the candidate scene and the most similar scene in the selected set is used to suppress redundancy.