A method for predicting an intention of a traffic participant
By integrating cross-attention mechanisms and a cascaded model of large-scale pre-trained language models, the problem of pedestrian and non-motorized vehicle intent recognition and prediction in complex traffic scenarios is solved, achieving accurate understanding of traffic scenarios and intent reasoning, thereby improving driving safety and driving experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-05-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to accurately identify and predict the diverse intentions of pedestrians and non-motorized vehicles in complex traffic scenarios. They lack a sufficient understanding of pedestrian and non-motorized vehicle behavior and key elements within the traffic environment, and existing methods are ill-suited for intention reasoning and prediction in complex situations.
We employ a cascaded approach combining an image description model and a large-scale pre-trained language model. By integrating a scene description model with a cross-attention mechanism, we retrieve traffic participants and generate natural language descriptions. We then combine cue word matching and a knowledge-driven pre-trained language model to infer intent. Finally, we construct a scene description model and a large-scale pre-trained language model for intent understanding and reasoning.
It achieves accurate intent prediction in complex traffic scenarios, can identify specific traffic elements in the scenario, improves driving safety and driving experience, and is applicable to driving assistance systems and early warning systems.
Smart Images

Figure CN116665147B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of traffic technology, specifically relating to a method for predicting the intentions of traffic participants. Background Technology
[0002] With the continuous development of computer vision and perceptual computing technologies, autonomous driving and assisted driving technologies have achieved significant progress in the transportation sector. However, in real-world driving scenarios, the random driving patterns of pedestrians and non-motorized vehicles pose serious challenges to the application of autonomous driving technology. Therefore, the intention reasoning and behavior prediction of pedestrians and non-motorized vehicles in traffic scenarios have important practical significance. Existing technologies primarily employ two methods for pedestrian and non-motorized vehicle intention reasoning and behavior prediction: image-based and video-based intention prediction. Image-based pedestrian intention prediction methods mainly classify and determine whether a pedestrian intends to cross the street based on indicators such as facial orientation and body movements. Specifically, patent document CN112329684A proposes a fixed-position pedestrian crossing intention detection method, using a fixed face camera at an intersection to identify facial orientation and classify whether a pedestrian intends to cross the street; patent document CN114550297A proposes a crossing intention detection method based on pedestrian action coding, which identifies pedestrian actions through image recognition and combines this with the traffic scene to determine the pedestrian's crossing intention. Researchers from York University and the University of Toronto identified pedestrian actions and combined them with semantic annotations of traffic text in the scene (weather, zebra crossings, traffic lights) to collaboratively predict pedestrians' intention to cross the road (Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior). Meanwhile, video-based methods for predicting pedestrian and non-motorized vehicle intentions predict intentions based on their dynamic behavior: Patent document CN109712388A proposes a method for detecting pedestrian and non-motorized vehicle crossing intentions by recording the turning-around actions and frequency of pedestrians and non-motorized vehicle drivers within a time window using a vehicle-mounted camera to determine whether they have the intention to cross the street; researchers from York University analyzed pedestrian crossing videos and determined whether pedestrians had the intention to cross the street based on the changes in posture and action time during the walking process (PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction).
[0003] The aforementioned methods for detecting pedestrian and non-motorized vehicle intentions exhibit the following characteristics: 1. Single task: Detecting pedestrian and non-motorized vehicle intentions for crossing the street is the primary focus of existing technologies; 2. Single scenario: Existing technologies only perform intention reasoning in simple, interference-free scenarios such as intersections and zebra crossings; 3. Lack of knowledge and reasoning process: Existing technologies employ black-box judgment models for classification tasks, lacking human knowledge and knowledge-driven intention reasoning processes. Existing methods lack a sufficient understanding of pedestrian and non-motorized vehicle behavior and key elements within traffic scenarios. Furthermore, the small-scale black-box models trained in simple scenarios lack the ability to perform complex intention reasoning. Therefore, existing technologies struggle to identify, reason about, and predict the diverse intentions of pedestrians and non-motorized vehicles in complex traffic scenarios.
[0004] To address this critical issue, an image description model cascaded with a large-scale pre-trained language model was used to detect the intentions of pedestrians and non-motorized vehicles in complex traffic scenarios. The image description model constructs an image processing module to understand traffic semantics, and for a continuous sequence of input traffic scene images, a text generation module forms a continuous scene description. Meanwhile, the large-scale pre-trained language model, based on vast amounts of human knowledge, uses powerful language reasoning and semantic understanding techniques to predict the intentions within the scene description. This cascaded approach endows the model with the ability to understand complex scenes and infer intentions.
[0005] Image captioning technology is a research hotspot in computer vision and natural language processing. Early researchers used convolutional neural networks to encode image features, followed by recurrent neural networks and long short-term memory networks to sequentially decode image annotations. With the development of self-attention mechanisms and deep self-attention networks (Transformers), Transformers have begun to be widely applied to image captioning tasks. The BLIP model uses a visual Transformer to mine image features and a text Transformer to mine text features, employing tasks such as image-text matching and language modeling to construct and learn image-text matching, generating image descriptions. The ViLT model inputs image and text vectors into the same Transformer network, allowing the model to fully understand the image-text interaction features and generate textual descriptions of the image. Current image captioning techniques often generate overall descriptions of input scene images, neglecting fine-grained traffic elements of interest within the scene (such as pedestrians, non-motorized vehicles, and traffic symbols).
[0006] Large-scale pre-trained language models are a research hotspot in the field of natural language processing. These models are trained on massive human-predicted knowledge bases using hundreds of billions of parameters, resulting in complex language processing capabilities such as machine translation, semantic understanding, text generation, and logical reasoning. As cross-domain general-purpose models, large-scale language models rely on carefully designed prompts to generate the user's desired feedback and employ prompt-word fine-tuning (a "question: answer" fine-tuning pattern) to fine-tune the language model's expert knowledge within the domain, altering its reasoning rules within specific knowledge areas. However, selecting prompts for complex language models is difficult, and fine-tuning of these models depends on new question-and-answer knowledge paradigms. Summary of the Invention
[0007] In view of the above, the object of the present invention is to provide a method for predicting the intentions of traffic participants, so as to achieve accurate prediction of the intentions of traffic participants.
[0008] To achieve the above-mentioned objective, an embodiment provides a method for predicting the intentions of traffic participants, comprising the following steps:
[0009] A scene description model that integrates cross-attention mechanism is used to retrieve traffic participants in scene images and describe their behavior in the scene using natural language, forming a traffic scene description sequence;
[0010] The optimal prompt word is matched with keywords extracted from traffic scene description sequences using a prompt word matching model based on a constructed prompt word library;
[0011] The intention inference is inferred by using a knowledge-driven pre-trained language model based on the combination of the optimal prompt and the traffic scene description sequence, and the inference result is output.
[0012] Preferably, the scene description model includes an associated scene retrieval module, a semantic analysis module, and a text-image conversion module;
[0013] The scene image is divided into sub-graphs and then mapped to embedded sub-graph vectors. The associated scene retrieval module provides traffic-related word vectors. Based on the traffic-related word vectors, the cross-attention weight between the embedded sub-graph vector and the associated traffic scene is calculated. This cross-attention weight is multiplied by the embedded sub-graph vector and then input to the semantic analysis module. The semantic analysis module encodes the input vector based on a multi-head deep self-attention mechanism to obtain a semantic vector, which is then input to the image-to-text conversion module. The image-to-text conversion module converts the semantic vector into natural language to form a traffic scene description sequence.
[0014] Preferably, the process of constructing the scene description model includes:
[0015] Construct sample data; prepare scene images and corresponding text descriptions to form sample data;
[0016] The training system comprises a scene retrieval module, a semantic analysis module, an image-text masking module, a comparison module, and an image-text matching module. Text descriptions in the sample data are segmented and mapped to embedded word vectors. Scene images in the sample data are divided into sub-images and then mapped to embedded sub-image vectors. The scene retrieval module provides traffic-related word vectors. Based on these traffic-related word vectors, cross-attention weights are calculated between the embedded sub-image vectors and the word embedding vectors and the associated traffic scenes. These cross-attention weights are multiplied by the embedded sub-image vectors and the word embedding vectors, respectively, and then input into the semantic analysis module. The semantic analysis module encodes the input vectors using a multi-head deep self-attention mechanism to obtain semantic vectors. The image-text masking module uses... For the image-text masking task, word vectors and embedded subgraph vectors are embedded in the random mask portion, and the reconstructed vector of the mask portion is recovered based on the semantic vector corresponding to the masked vector. The image-text matching task module is used to construct image and text matching tasks. A blank marker vector is added to the head of the weighted word embedding vector and embedded subgraph vector respectively and input to the semantic analysis module. The text global representation and image global representation corresponding to the blank marker vector output by the semantic analysis module are used to determine whether the image and text match. The comparison task module is used to construct the comparison task. The image global representation and text global representation corresponding to the matching image and text are used as positive sample pairs, and the image global representation and text global representation corresponding to the non-matching image and text are used as negative sample pairs.
[0017] Constructing multi-task loss: Multi-task loss includes masking loss for image-text masking task, matching loss for image-text matching task, and matching loss for comparison task;
[0018] Parameter optimization training system: The training system is optimized by using multi-task loss. After optimization, the semantic analysis module and the associated scene retrieval module of the optimized parameters are extracted, and then a text-image conversion module is added to obtain the scene description model.
[0019] Preferably, the cross-attention weights between the embedded subgraph vector and the associated traffic scene are calculated based on the traffic-related word vectors, including:
[0020] The vector dot product of the embedded subgraph vector and all traffic-related word vectors is calculated, concatenated to form a traffic similarity vector, and a linear transformation network is used to map the traffic similarity vector to a one-dimensional scalar, which serves as the cross-attention weight of the embedded subgraph vector.
[0021] The cross-attention weights between embedded word vectors and related traffic scenarios are calculated based on traffic-related word vectors, including:
[0022] The vector dot product of the embedded word vector and all traffic-related word vectors is calculated and concatenated to form a traffic similarity vector. Another linear transformation network is used to map the traffic similarity vector to a one-dimensional scalar, which serves as the cross-attention weight of the embedded word vector.
[0023] Preferably, the mask loss L m Represented as:
[0024]
[0025] Where i represents the index of the masked embedded word vector or subgraph vector, and j represents the dimension index in the embedded word vector or subgraph vector. Let represent the j-th dimension of the i-th masked input vector, and express The j-th dimension of the reconstructed vector obtained after semantic analysis and linear decoding, d K Indicates the dimension of the embedded subgraph vector and word vector;
[0026] The matching loss L s Represented as:
[0027] L s =y t p(c1,c2) ture +(1-y t p(c1,c2) false
[0028] Among them, y t This indicates matching tags, p(c1,c2). ture and p(c1,c2) false c1 and c2 represent the probabilities of a match and a non-match in the image-text matching task, respectively. c1 is the image header encoding and c2 is the text header encoding. The artificial neural network compares the differences between the two header encodings and uses the softamx function to generate the probability of "image-text match" or "image-text non-match".
[0029] The contrast loss L c Represented as:
[0030]
[0031]
[0032]
[0033] Where B represents batch, pair + and pair - These represent the positive sample loss and the negative sample loss, respectively, and sim() represents the similarity measure. and Let's represent the text global representation and image global representation of the b-th positive sample belonging to batch B. and Let represent the text global representation from the b1-th sample and the image global representation from the b2-th sample belonging to batch B, and τ represent the hyperparameter temperature coefficient.
[0034] The multi-task loss L pt Represented as:
[0035]
[0036] Where, σ s σ c σ m These represent the uncertainty weights of the matching loss, contrast loss, and masking loss, respectively, and need to be optimized.
[0037] Preferably, the text description, after word segmentation, is mapped to embedded word vectors, including: using word2vec to embed the segmented words into word vectors e. pos The word vector and the positional encoding of the word vector PE pos and text modal features V t Forming the embedding vector t pos This can be expressed as a formula:
[0038]
[0039] Among them, text modal features V t This represents a trainable common encoded representation;
[0040] The scene image is divided into sub-images and then mapped to embedded sub-image vectors, including: dividing the i-th sub-image x i It becomes an embedded subgraph vector p through linear mapping. i This can be expressed as a formula:
[0041] p i =w f ·flatten(x i )
[0042] Here, `flatten()` represents the stretching operation of the space matrix, and `w`... f This represents the trainable parameter matrix.
[0043] Preferably, the semantic analysis module encodes the input vector to obtain a semantic vector based on a multi-head deep self-attention mechanism, including:
[0044] The input vector is processed by a multi-head deep self-attention mechanism to obtain a coupled representation vector. This representation vector is encoded by multiple layers of fully connected, residual connected, and layer regularization to mine the image-text aligned coupled representation as a semantic vector.
[0045] Preferably, the optimal prompt word is matched based on the constructed prompt word library using a prompt word matching model to match keywords extracted from the traffic scene description sequence, including:
[0046] Multiple keywords are extracted from the traffic scene description sequence, and the average embedding word vector of the multiple keywords is used as the scene description feature. The prompt word matching model maps the scene description feature to the prompt word space of the prompt word library through a shallow nonlinear fully connected network, and selects the prompt word with the highest probability as the optimal prompt word.
[0047] Preferably, the pre-trained language model includes a scenario-based question answering module and an intent prediction module;
[0048] The scenario question answering module understands the scenario content based on the combination of the optimal input prompt and the traffic scenario description sequence, and constructs inference prompt words. The intent prediction module performs inference based on the inference prompt words to predict traffic intent and provides feedback in natural language.
[0049] Preferably, the scene description model, the prompt word matching model, and the pre-trained language model undergo collaborative asynchronous fine-tuning before being applied. The specific process includes:
[0050] The scene description model is used as the description end, and the prompt word matching model and the pre-trained language model are used as the language end;
[0051] First, the parameters of the description-side model are fixed, and the language-side is only constrained to perform correct intent reasoning under the traffic scene description. This includes: inputting labeled intents into the pre-trained language model, predicting intents, driving the pre-trained language model to analyze intent differences based on the prompt word "What are the differences between the two intents", driving the language model to analyze the knowledge gaps that cause the differences based on the prompt word "What additional knowledge needs to be supplemented to correctly predict intent", and segmenting it into dialogue paradigms to achieve instruction and knowledge fine-tuning of the pre-trained language model. At the same time, for the prompt word matching model, the difference analysis index is used as a multi-task loss, and its backpropagation gradient is used to update the prompt word matching model parameters to assist in selecting the prompt words that produce the optimal results.
[0052] Then, the parameters of the language-side model are fixed, and the parameters of the description-side model are optimized, including: optimizing the parameters of the scene description model based on the difference analysis index as the loss function;
[0053] Among them, the difference analysis index is the difference between the labeled intent and the predicted intent.
[0054] Compared with the prior art, the beneficial effects of the present invention include at least the following:
[0055] By integrating a cross-attention mechanism of image and text vectors, a scene description model based on image-text pair co-training is constructed. This model provides descriptions for scene images and tends to form specific descriptions of traffic elements (pedestrians, non-motorized vehicles, traffic symbols, and traffic lights) within the scene. It can selectively identify specific traffic elements in real-world driving scenarios, and the model generates accurate language descriptions that approach human-level proficiency.
[0056] This invention integrates scene description models and large-scale pre-trained language models for traffic scene intent understanding and inference. The proposed intent inference method can predict and infer the intentions of pedestrians and non-motorized vehicles in various real-world traffic scenarios, essentially predicting their salient intentions and identifying their potential random intentions. The intent understanding and prediction achieved by this method can help drivers and autonomous driving systems enhance their perception of traffic conditions and has broad applicability to various and complex traffic scenarios. It can be applied to intelligent driving functions such as driver assistance systems and early warning systems, effectively improving driving safety and driving experience. Attached Figure Description
[0057] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0058] Figure 1 This is a flowchart of the method for predicting the intentions of traffic participants provided in the embodiment;
[0059] Figure 2 This is a flowchart illustrating the method for predicting the intentions of traffic participants provided in the embodiment;
[0060] Figure 3 This is a schematic diagram of the training system structure for constructing a scene description model provided in the embodiment;
[0061] Figure 4 This is a schematic diagram of the joint fine-tuning process of the description end and the language end provided in the embodiment;
[0062] Figure 5 This is a flowchart of the fine-tuning description terminal provided in the embodiment;
[0063] Figure 6 This is a flowchart of the fine-tuning language interface provided in the embodiment. Detailed Implementation
[0064] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of this invention.
[0065] In traffic scenarios, pedestrians and non-motorized vehicles often exhibit different behavioral patterns compared to previous periods. These changes are related to subjective factors; for example, a pedestrian standing and looking around at a crosswalk may indicate an intention to cross the road, while a non-motorized vehicle slipping through a puddle may indicate an impending fall. Unlike judging the speed and position of other vehicles in traffic scenarios, inferring these intentions requires a deeper understanding and prediction of common pedestrian and non-motorized vehicle behavioral patterns. To address this need, embodiments of the present invention provide a method for predicting the intentions of traffic participants. By accurately identifying these changes in intention, effective driving assistance can be provided to drivers and autonomous driving systems, thereby enhancing the performance and safety of autonomous and semi-autonomous vehicles. Furthermore, the broad applicability of the prediction method provided by this invention can help reduce driver workload and improve driver safety and comfort.
[0066] The inventive concept of this invention is as follows: Addressing the technical problem that existing methods for generating overall descriptions of input scene images often neglect fine-grained traffic elements of interest within the scene, the prediction method provided by this invention introduces a cross-attention mechanism between image and text vectors. This adds weights to image-text vectors related to important traffic information from the image-text pairs, thus guiding the model to focus on key information within the image scene in a controlled manner. Furthermore, addressing the technical challenges of selecting prompt words for complex language models and the reliance on new question-answering knowledge paradigms for language model fine-tuning, the prediction method provided by this invention detects descriptive keywords for scene descriptions and uses an adaptively trained prompt word matching module to dynamically select prompt words from a prompt word library. Finally, the prediction method provided by this invention also offers an asynchronous fine-tuning strategy to reduce gradient interference and negative transfer problems in the model, enabling the model to be transferred from general scenarios across multiple domains to traffic scenarios through small-sample fine-tuning.
[0067] Based on the above inventive concept, the present invention provides a method for predicting the intentions of traffic participants, comprising the following steps:
[0068] S1 uses a scene description model with cross-attention mechanism to retrieve traffic participants in scene images and describes their behavior in the scene using natural language, forming a traffic scene description sequence.
[0069] In this embodiment, the scene description model integrating the cross-attention mechanism includes an associated scene retrieval module, a semantic analysis module, and a text-to-image conversion module. The application process is as follows: the scene image is divided into sub-graphs and then mapped into embedded sub-graph vectors. The associated scene retrieval module provides traffic-related word vectors. Based on the traffic-related word vectors, the cross-attention weight between the embedded sub-graph vector and the associated traffic scene is calculated. The cross-attention weight is multiplied by the embedded sub-graph vector and then input to the semantic analysis module. The semantic analysis module encodes the input vector based on a multi-head deep self-attention mechanism to obtain a semantic vector and inputs it to the text-to-image conversion module. The text-to-image conversion module converts the semantic vector into natural language to form a traffic scene description sequence.
[0070] The construction process of the above scene description model includes:
[0071] (a) Constructing sample data: Prepare scene images and corresponding text descriptions to form sample data.
[0072] In this embodiment, the image-description text pair dataset COCO Caption can be used as sample data for constructing a scene description model for a general scenario.
[0073] Before being input into the training system, the sample data needs to be converted into embedding vectors. Specifically, the text descriptions in the sample data are segmented and mapped to embedded word vectors. The process involves using word2vec to embed the segmented words into, for example, 256-dimensional word vectors. pos The word vector and the positional encoding of the word vector PE pos and text modal features V t Forming the embedding vector t pos This can be expressed as a formula:
[0074]
[0075] PE (pos,2i) =sin(pos / 10000) 2i / 256 )
[0076] PE (pos,2i+1) =cos(pos / 10000) 2i / 256 )
[0077] Among them, the positional encoding of word vectors PE pos The dimension is the same as the word vector dimension, and it is calculated using word vectors and trigonometric function positional encoding, where i is an integer greater than 0 and less than the vector dimension 256 / 2. The text modal feature V... t Represents trainable vector parameters, for example, 256 dimensions, and represents the flag features of the text modality.
[0078] The scene images in the sample data are divided into sub-images and then mapped to embedded sub-image vectors. This includes spatial matrices, which, unlike textual descriptions, are difficult to vectorize. Therefore, a linear mapping is used to map the scene images into different embedded vectors. For example, a scene image is divided into, for example, 196 16*16*3 sub-images, where the i-th sub-image x... i After being stretched, it undergoes a linear transformation and is mapped to the vector feature space, forming a subgraph embedding vector p. i This can be expressed as a formula:
[0079] p i =w f ·flatten(x i )
[0080] Here, `flatten()` represents the stretching operation of the spatial matrix, stretching the subgraph from a 16*16*3 three-dimensional spatial matrix to a 1*768 subgraph vector, w f This represents the trainable parameter matrix, which is directly multiplied by the stretched subgraph to reduce the subgraph vector to 256 dimensions and map it to the vector feature space.
[0081] (b) Constructing the training system and multi-task loss: The training system includes a related scene retrieval module, a semantic analysis module, an image-text masking task module, a comparison task module, and an image-text matching task module.
[0082] In this embodiment, the associated scene retrieval module provides a traffic-related word vector database and uses a cross-attention mechanism to calculate the cross-attention weights of the embedded subgraph vector and word embedding vector with the associated traffic scene based on the traffic-related word vectors, so as to promote the semantic analysis module of image-text collaboration to pay more attention to the image-text information related to traffic information.
[0083] Specifically, firstly, a traffic-related word vector database is constructed, storing traffic-related word vectors containing key traffic terms such as traffic lights, pedestrians, non-motorized vehicles, and zebra crossings. For each embedded word vector, the dot product of the embedded word vector and all traffic-related word vectors is calculated, concatenated to form a traffic similarity vector, and then a linear transformation network W is used. at Traffic similarity vectors are mapped to a one-dimensional scalar and used as cross-attention weights for embedded word vectors. Similarly, for each embedded subgraph vector, the vector dot product of the embedded subgraph vector and all traffic-related word vectors is calculated, concatenated to form a traffic similarity vector, and then another linear transformation network W is used. ap Traffic similarity vectors are mapped to one-dimensional scalars and used as cross-attention weights for embedding subgraph vectors. The parameters of the two linear transformation networks are trained iteratively, driving the association retrieval module to retrieve key traffic terms such as pedestrians and non-motorized vehicles to form a scene description. The calculation formulas for the two cross-attention weights are as follows:
[0084]
[0085]
[0086] Where δ represents the sigmoid activation function, and These are the cross-attention weights for the embedded word vectors and embedded subgraph vectors, respectively. `concat` represents the concatenation operation, and `t` and `p` represent the current embedded word vector and embedded subgraph vector, respectively. The vectors are traffic-related words, k in total, where b is the bias and W is the vector. at and W ap All are weights. The resulting two cross-attention weights and After being multiplied by the embedded subgraph vector and the word embedding vector respectively, the input is organized into the semantic analysis module according to the text order-image order pattern.
[0087] In this embodiment, the semantic analysis module uses a multi-head deep self-attention mechanism to mine the relationships within text words, spatial relationships within image subgraphs, and text-image modal alignment relationships from the input vector, generating an image-text aligned coupled representation as a semantic vector. For example, a deep self-attention encoding network (Transformer Encoder) can be used based on a self-attention mechanism to construct Query, Key, and Value vectors for each input vector to mine the relationships between different input vectors. This uses a trainable parameter matrix W. Q W K W V It is derived from a linear mapping of the input vector, and its calculation is as follows:
[0088] Q i =W Q ×IN i
[0089] K i =W K ×IN i
[0090] V i =W V ×IN i
[0091] Here, IN is the ordered set of input vectors. The dot product of the query vector Q and the key vector K is used to calculate the similarity measure between different input vectors, and the weighted vectors are added to form the coupled representation vector. The calculation process is as follows:
[0092]
[0093]
[0094] Among them, S ij This represents the similarity measure between the i-th input vector and the j-th input vector. d K represents the dimension of the generated K vector, which is set to 256 in this invention. e is the natural logarithm, and re represents the weighted coupled representation vector after probability weights are generated by the softmax activation function. For each input vector, the coupled representation vector undergoes multiple layers of fully connected, residual connected, and layer regularization operations to maintain the resolution and gradient information, and the complex image-text alignment coupling features are mined as semantic vectors.
[0095] To understand the abstract semantics of image-language within a scene, image-text masking, image-text matching, and comparison modules were added for self-supervised training of the entire scene description model.
[0096] In this embodiment, the image-text masking task module is used to construct the image-text masking task. The image and text masking task randomly masks a portion of the input embedded word vectors and image embedded vectors. Based on the semantic vectors corresponding to the masked vectors, a linear decoder is used to recover the reconstructed vectors of the masked portion as much as possible using contextual information. This masking method helps the model fully understand the contextual relationship between the image and text and uncover potential associated features. In this invention, since the image and text vectors are input to the same semantic analysis module, the reconstruction of the masked vectors simultaneously considers the coupling information of text and images, uncovering the abstract semantics corresponding to the image and text. Based on this, the squared reconstruction error is used as the masking loss L. m This can be expressed as a formula:
[0097]
[0098] Where i represents the index of the masked embedded word vector or subgraph vector, and j represents the dimension index in the embedded word vector or subgraph vector. Let represent the j-th dimension of the i-th masked input vector, and express The j-th dimension of the reconstructed vector obtained after semantic analysis and linear decoding, d K The dimensions of the embedded subgraph vector and word vector are represented. During training, the mask multi-task loss is minimized to make the reconstructed vector close to the original vector, and the network fully understands the contextual relationship between the input graph and text vectors.
[0099] In this embodiment, the image-text matching task module is used to construct the image and text matching task. During the input process of the semantic analysis module, a mismatch rate of 0.5 is used to shuffle the input text-image pairs: in normal mode, the input text describes the input image vector; in mismatch mode, the input text describes the remaining image vectors. Specifically, a blank marker vector is added to the head of the weighted word embedding vector and the embedded subgraph vector, respectively, and input to the semantic analysis module. The text global representation and image global representation corresponding to the blank marker vector output by the semantic analysis module are used to determine whether the image-text matches. The negative log-likelihood function is used as the matching loss L for the image-text matching task. s A self-supervised training network can be expressed by the following formula:
[0100] L s =y t p(c1,c2) ture +(1-y t p(c1,c2) false
[0101] Among them, y t This indicates matching tags, p(c1,c2). ture and p(c1,c2) false c1 and c2 represent the probabilities of a match and a non-match in the image-text matching task, respectively. c1 is the image header encoding and c2 is the text header encoding. The artificial neural network compares the differences between the two header encodings and uses the softamx function to generate the probability of "image-text match" or "image-text non-match". During training, by minimizing the negative log-likelihood loss, the model aligns the embedding features of the image and text modalities to understand the semantics of image-text interaction.
[0102] In this embodiment, the comparison task module is used to construct the comparison task, which divides the representation into positive and negative sample pairs. Features are learned by narrowing the information distance between positive sample pairs and widening the information distance between negative sample pairs in the representation space. Matching image-text pairs, specifically the image global representation and text global representation, are considered positive sample pairs. The text description corresponds to the image information, and they express approximate information in the representation space. Unmatching image-text pairs, specifically the image global representation and text global representation, are considered negative sample pairs. Their text descriptions do not correspond to the image information, and they express different information in the representation space. Based on this, INFONCE loss is used as the comparison loss L. c A self-supervised training network can be expressed by the following formula:
[0103]
[0104]
[0105]
[0106] Where B represents batch, pair + and pair - These represent the positive sample loss and the negative sample loss, respectively. `sim()` represents the similarity measure, which can be calculated using vector dot product, Euclidean distance, cosine similarity, etc. and Let's represent the text global representation and image global representation of the b-th positive sample belonging to batch B. and Let represent the text global representation from sample b1 and the image global representation from sample b2 belonging to batch B, and let τ represent the hyperparameter temperature coefficient, which controls the hyperparameter temperature coefficient of contrastive learning similarity.
[0107] Based on the above losses, uncertain weights are used to form the multi-task loss L in multi-task self-supervised training. pt The model is trained collaboratively, where the multi-task loss L pt This includes the masking loss for image-text masking tasks, the matching loss for image-text matching tasks, and the comparison loss for comparison tasks, expressed by the formula:
[0108]
[0109] Where, σ s σ c σ m The weights representing the uncertainty of the matching loss, contrast loss, and masking loss, respectively, need to be optimized. This novel multi-task self-supervised loss function and image-text cross-attention mechanism aim to improve the model's semantic understanding in image scene description and drive the model to focus on key traffic information in the scene.
[0110] (c) Parameter optimization training system: The parameters of the training system are optimized using multi-task loss. After optimization, the semantic analysis module and the associated scene retrieval module of the optimized parameters are extracted, and then a text-image conversion module is added to obtain the scene description model.
[0111] In this embodiment, a multi-task loss pre-training system is used to enable the system to fully understand the important semantics of pedestrians and non-motorized vehicles in traffic scenes, giving it strong scene understanding capabilities. After training, the semantic analysis module and associated scene retrieval module with optimized parameters are extracted, and then an image-to-text conversion module is added to obtain the scene description model. The image-to-text conversion module uses a Transformer-based decoder network to decode the scene text description from image semantic features. In the scene description tasks for pedestrians and non-motorized vehicles, the scene description model only inputs the embedded subgraph vectors of the scene image to generate representations, while the standalone image-to-text conversion module acts as an image-to-text decoder to understand the semantics and generate descriptions of the behavior and state of pedestrians and non-motorized vehicles in traffic scenes, forming a traffic scene description sequence.
[0112] S2 uses a prompt word matching model to match the optimal prompt word based on the constructed prompt word library for keywords extracted from the traffic scene description sequence.
[0113] For traffic scenarios, the scene image sequence generates a continuous scene description sequence through the scene description model in step S1. A large-scale pre-trained language model with hundreds of billions of parameters and massive amounts of human knowledge is used to predict the traffic intentions of pedestrians, non-motorized vehicles, and other traffic participants from continuous scene descriptions. The pre-trained language model constructs corresponding prompt words, understands the descriptive information in the scene, and completes the specified task by combining human knowledge. Therefore, the prompt words input to the large-scale pre-trained language model are particularly important during the reasoning process. To address this issue, this embodiment of the invention provides a prompt word matching strategy: a prompt word library of approximately 3,000 prompt words applicable to different traffic scene descriptions is constructed for intention reasoning in different scenarios. Based on this prompt word library, the prompt word matching model matches the optimal prompt word for keywords.
[0114] Specifically, a summary prompt word is set for the pre-trained language model: "Please summarize 10 key nouns and 10 key verbs in the continuous scene description." The pre-trained language model summarizes the specified keywords from the traffic scene description sequence and uses the average word embedding vector of these keywords as the scene description features. Then, a shallow nonlinear fully connected network maps the scene description features to the prompt word space, and selects the prompt word with the highest probability as the optimal prompt word. The calculation process is represented as follows:
[0115]
[0116] Here, `prompt` represents the optimal prompt word, `argmax` is the index of the prompt word with the highest probability, and `softmax` is the normalized exponential activation function used to form a probability distribution from the output features. `FC()` represents a shallow fully connected network with a non-linear activation function, used to map scene description features to a 3000-dimensional prompt word space. The average embedding word vectors of scene description keywords are summarized for the pre-trained language model, which are the scene description features. At this time, the prompt word matching strategy adaptively selects the prompt words most relevant to the scene, and the trainable fully connected network parameters of the selection strategy are fully trained in the fine-tuning in step S4. The intention inference evaluation is constrained, and the model selects the prompt words with the best effect based on the scene description features.
[0117] S3 utilizes a knowledge-driven pre-trained language model to infer intent based on a combination of the optimal prompt and the traffic scene description sequence, and outputs the inference result.
[0118] In this embodiment, the pre-trained language model includes a scenario-based question-answering module and an intent prediction module.
[0119] After prompt word matching, the obtained optimal prompt word is combined with a continuous traffic scene description sequence and input into a large-scale pre-trained language model for scene question answering. The scene question answering module understands the scene content based on the combination formed by the input optimal prompt word and the traffic scene description sequence. The continuous traffic scene description sequence, combined with timestamp input, helps the pre-trained language model understand the temporal order between different descriptions: the scene at second 1 is..., the scene at second 2 is... The combined description and prompt words help understand the key information in the scene description and the temporal trend of pedestrian and non-motorized vehicle behavior. Based on a full understanding of the scene content, inference prompt words are constructed by combining the key information in the scene description with stored human knowledge and experience. The constructed inference prompt word is: "Please combine traffic knowledge and human experience to infer the subsequent intentions of pedestrians and non-motorized vehicles in the above continuous scene." The intent prediction module performs inference based on the inference prompt word to predict traffic intent and provides feedback in natural language. This embodiment applies the prompt word matching model and the pre-trained language model to traffic intent inference to adapt to different traffic scenarios and improve the scene question answering effect.
[0120] In this embodiment, an image sequence-intent pair dataset for traffic scenarios is also constructed for fine-tuning the scene description model, cue word matching model, and pre-trained language model. Currently, there is no image sequence-intent dataset related to intent detection in traffic scenarios in the relevant technical fields. This invention first constructs an image sequence-intent pair dataset for traffic scenarios. The image sequences are acquired using an RGB three-channel camera mounted on the windshield of a vehicle, and the camera includes red, yellow, and green primary color channels. The acquired image information is stored at a resolution of 224×224, forming an X... i ∈R 224×224×3A three-dimensional image matrix, where 2^24 represents the horizontal and vertical pixel resolution of the acquired image, and 3 represents the RGB channels of the image. A set of m consecutively sampled scene images along the time dimension forms a scene image sequence: IM i ∈R m ×224×224×3 In image sequence acquisition, only key traffic scene elements such as pedestrians and non-motorized vehicles are considered. Therefore, only scene image sequences containing pedestrians and non-motorized vehicles are retained for understanding and predicting their traffic intentions. Outside of real-world scenarios, simulated scene data is acquired using the LGSVL driving simulator. Simulated scenes are designed to supplement traffic conditions that are difficult to capture in real-world scenarios (e.g., scenarios involving inferring the special intentions of pedestrians pressing crosswalk signal buttons). The data format for simulated scenes is completely consistent with that for real-world scenarios.
[0121] Experienced human scene intent labelers perform human intent inference for all scenarios. These labelers are required to make detailed inferences about the intentions of pedestrians and non-motorized vehicles based on traffic conditions. For each scenario, the intent inference must be expressed as accurately as possible within 100 English words, incorporating important traffic elements (such as traffic signs, traffic lights, crosswalks, and road conditions). At this point, each scenario is labeled with a human intent tag Y. i ∈[word1, wor2, wor d 3, ..., word n [ ], n≤100. For words in the annotated intent text, an unsupervised word2vec method is used to vectorize them, generating high-quality word embedding vectors. In this invention, a skip-word method (given a word's random one-hot encoding, predicting the one-hot encoding of its context words) is used to construct an unsupervised loss, mapping the random one-hot encoding of words into 256-dimensional feature vectors, and semantically similar words have similar vectors in the feature space. The annotated intent text forms Yi∈R n×256 The intent matrix, and the traffic scene image sequence-intent data set S corresponding to the image sequence. i ={IM i Y i}
[0122] In this embodiment, an asynchronous fine-tuning strategy is also provided. In S1 and S2 of this invention, the model trained on multi-domain general scenario data performs poorly in traffic scene intent reasoning and is difficult to adapt to pedestrian and non-motorized vehicle intent reasoning tasks in complex traffic scenarios. Therefore, this embodiment of the invention proposes a model fine-tuning strategy, which uses the above-mentioned collected traffic scene image sequence-intent dataset to fine-tune the scene description model, prompt word matching model, and pre-trained language model, integrates traffic scene knowledge, and transfers the multi-domain general scenario model to the traffic domain. In the traditional joint fine-tuning mode, a metric function is used to quantify the difference between manually labeled intent and model-generated intent in the traffic scene, and corresponding backpropagation gradients are generated to fine-tune the model parameters. This will cause the difference loss function to couple and interfere with the gradients of different models: the scene description model needs to dynamically adapt to the pre-trained language model, and the generated pre-trained language model expects to describe the key traffic information; while the fine-tuning of the pre-trained language model's parameters and the dynamic selection of prompt words will affect the gradient descent direction of the scene description model. The scene description model is prone to parameter fluctuations during model fine-tuning, resulting in a negative transfer effect. The asynchronous fine-tuning strategy proposed in this invention is mainly divided into two types: language-side fine-tuning and description-side fine-tuning. Specifically, the scene description model is used as the description end, and the prompt word matching model and the pre-trained language model are used as the language end. To perform asynchronous fine-tuning, this invention first constructs a measure of the difference between the intent label and the model's predicted intent based on cross-cosine similarity. Cross-cosine similarity calculates the similarity of word vectors between the label intent and the predicted intent, forming a quantified difference index, as shown in the following calculation:
[0123]
[0124] Among them, L ft As a differential analysis metric, Y represents the tag intent (sentence), while y is the word embedding vector of each word in the manually tagged intent; To predict the intent (sentence) for the model, The model predicts the word embedding vector for each word in the intent. A higher dissimilarity index indicates more similar intents, while a lower dissimilarity index indicates less similar intents.
[0125] Secondly, the asynchronous fine-tuning strategy requires initial fine-tuning on the language side. During language-side fine-tuning, all parameters of the description model are frozen, and the scene description model is not adjusted; only the language side is constrained to perform correct intent reasoning within the traffic scene description. Language-side fine-tuning comprises two parts: knowledge fine-tuning (prompt tuning) of the large-scale pre-trained language model, and prompt word matching strategy fine-tuning when inputting into the model. In the knowledge fine-tuning of the large-scale pre-trained language model, a set of fine-tuned prompt words is proposed, enabling the language model to adaptively analyze and supplement missing knowledge from the differences between labels and predicted intents. Labeled intents and predicted intents are input into the pre-trained language model. The language model analyzes intent differences based on the prompt word "What are the differences between the two intents?", and analyzes insufficient knowledge that produces these differences based on the prompt word "What additional knowledge needs to be supplemented for correct intent prediction?". This is then segmented into dialogue paradigms to achieve instruction and knowledge fine-tuning of the language model. Simultaneously, for the prompt word matching model, the difference analysis index is used as a loss function, and its backpropagation gradient is used to update the prompt word matching model parameters, assisting in selecting the prompt words that produce the optimal results. In the language-side fine-tuning, the optimization steps are as follows:
[0126]
[0127]
[0128] frozen(θ ic )
[0129]
[0130]
[0131] The optimization objective of the fine-tuning is the negative logarithmic intention difference index, where θ is... ic θ is the parameter of the scene description model. fc For prompt word matching model parameters, `frozen` indicates a parameter freezing operation. For large-scale pre-trained language models, `prompt` represents the original knowledge rules in the model, while `knowledge()` represents supplementary knowledge rules formed by the language model from the self-analysis of the difference between predicted intent and labeled intent. `add()` represents the merging of original knowledge rules and supplementary knowledge rules to form a language model with fine-tuning of supplementary knowledge. For prompt word matching models, the gradient of the optimization objective on the parameters is combined with the learning rate α to form parameter updates in the gradient direction. General pre-trained language models are fine-tuned by incorporating manually labeled traffic field data. The pre-trained language models migrate from general scenarios to traffic scenarios, tending to generate traffic element intent predictions that human drivers are interested in.
[0132] After fine-tuning bs image sequence-intent pairs on the language side, fine-tuning is performed on the description side. All model parameters on the language side are frozen, and the parameters of the description side model are fine-tuned. At this point, the fine-tuning of the scene description model is free from gradient interference from the pre-trained language model, and the result helps the model tend to generate scene descriptions that the pre-trained language model focuses on regarding traffic information. In the scene-side fine-tuning, the parameters of the scene description model are optimized based on a difference analysis metric as the loss function, so that the scene description information from the image-text co-training tends to align with the inference information required by the description language model. The optimization steps are as follows:
[0133]
[0134]
[0135] frozen(θ fc )
[0136]
[0137] In the asynchronous fine-tuning strategy, the above steps are repeated continuously. The model is fine-tuned on the dataset of traffic scene videos and human intentions collected above, without gradient interference or negative transfer. The general scene description model and pre-trained language model achieve parameter transfer to the traffic domain, enabling intent understanding, scene description, action prediction, and inference for pedestrians and non-motorized vehicles in complex traffic scenes. This asynchronous fine-tuning strategy decouples the key gradient information required by different parts of the method, reduces training interference, and achieves small-sample transfer of the general model to traffic scene intent understanding and prediction. In particular, this invention innovatively proposes a complete asynchronous fine-tuning strategy for this part to reduce gradient interference and negative transfer during the fine-tuning process and improve the traffic scene intent prediction effect.
[0138] This invention constructs a traffic scene image sequence-intent dataset with rich scenarios and accurate intent annotations. Based on this dataset, an asynchronous fine-tuning method is employed to optimize the model. The proposed asynchronous fine-tuning method helps solve the gradient interference and negative transfer problems that exist when adapting models from general scenarios across multiple domains to traffic scenarios, effectively transferring models trained on mixed-domain scenario data to the traffic domain. This fine-tuning method adapts to complex traffic scenarios and improves intent reasoning performance by incorporating human drivers' manual intent reasoning labels in complex scenarios. In practical applications, this method can reduce the traffic scene sample data required for intent reasoning models, integrate real-world scenarios to adapt to the traffic domain, and improve reasoning performance, demonstrating practical application value.
[0139] The specific embodiments described above illustrate the technical solution and beneficial effects of the present invention in detail. It should be understood that the above description is only the most preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for predicting the intentions of traffic participants, characterized in that, Includes the following steps: A scene description model that integrates cross-attention mechanism is used to retrieve traffic participants in scene images and describe their behavior in the scene using natural language, forming a traffic scene description sequence; The optimal prompt word is matched with keywords extracted from traffic scene description sequences using a prompt word matching model based on a constructed prompt word library; The intent is inferred and the inference result is output by using a knowledge-driven pre-trained language model based on the combination of the optimal prompt words and the traffic scene description sequence. The scene description model includes an associated scene retrieval module, a semantic analysis module, and a text-image conversion module. The scene image is divided into sub-graphs and then mapped to embedded sub-graph vectors. The associated scene retrieval module provides traffic-related word vectors. Based on the traffic-related word vectors, the cross-attention weight between the embedded sub-graph vector and the associated traffic scene is calculated. This cross-attention weight is multiplied by the embedded sub-graph vector and then input to the semantic analysis module. The semantic analysis module encodes the input vector based on a multi-head deep self-attention mechanism to obtain a semantic vector, which is then input to the image-to-text conversion module. The image-to-text conversion module converts the semantic vector into natural language to form a traffic scene description sequence.
2. The method for predicting the intentions of traffic participants according to claim 1, characterized in that, The construction process of the scene description model includes: Construct sample data; prepare scene images and corresponding text descriptions to form sample data; The training system comprises a scene retrieval module, a semantic analysis module, an image-text masking module, a comparison module, and an image-text matching module. Text descriptions in the sample data are segmented and mapped to embedded word vectors. Scene images in the sample data are divided into sub-images and then mapped to embedded sub-image vectors. The scene retrieval module provides traffic-related word vectors. Based on these traffic-related word vectors, cross-attention weights are calculated between the embedded sub-image vectors and the word embedding vectors and the associated traffic scenes. These cross-attention weights are multiplied by the embedded sub-image vectors and the word embedding vectors, respectively, and then input into the semantic analysis module. The semantic analysis module encodes the input vectors using a multi-head deep self-attention mechanism to obtain semantic vectors. The image-text masking module uses... For the image-text masking task, word vectors and embedded subgraph vectors are embedded in the random mask portion, and the reconstructed vector of the mask portion is recovered based on the semantic vector corresponding to the masked vector. The image-text matching task module is used to construct image and text matching tasks. A blank marker vector is added to the head of the weighted word embedding vector and embedded subgraph vector respectively and input to the semantic analysis module. The text global representation and image global representation corresponding to the blank marker vector output by the semantic analysis module are used to determine whether the image and text match. The comparison task module is used to construct the comparison task. The image global representation and text global representation corresponding to the matching image and text are used as positive sample pairs, and the image global representation and text global representation corresponding to the non-matching image and text are used as negative sample pairs. Constructing multi-task loss: Multi-task loss includes masking loss for image-text masking task, matching loss for image-text matching task, and matching loss for comparison task; Parameter optimization training system: The training system is optimized by using multi-task loss. After optimization, the semantic analysis module and the associated scene retrieval module of the optimized parameters are extracted, and then a text-image conversion module is added to obtain the scene description model.
3. The method for predicting the intentions of traffic participants according to claim 1 or 2, characterized in that, The cross-attention weights between the embedded subgraph vector and the associated traffic scene are calculated based on traffic-related word vectors, including: The vector dot product of the embedded subgraph vector and all traffic-related word vectors is calculated, concatenated to form a traffic similarity vector, and a linear transformation network is used to map the traffic similarity vector to a one-dimensional scalar, which serves as the cross-attention weight of the embedded subgraph vector.
4. The method for predicting the intentions of traffic participants according to claim 2, characterized in that, The mask loss L m Represented as: Where i represents the index of the masked embedded word vector or subgraph vector, and j represents the dimension index in the embedded word vector or subgraph vector. Let represent the j-th dimension of the i-th masked input vector, and express The j-th dimension of the reconstructed vector obtained after semantic analysis and linear decoding, d K Indicates the dimension of the embedded subgraph vector and word vector; The matching loss L s Represented as: L s =y t p(c1,c2) ture +(1-y t )p(c1,c2) false Among them, y t This indicates matching tags, p(c1,c2). ture and p(c1,c2) false c1 and c2 represent the probabilities of a match and a non-match in the image-text matching task, respectively. c1 is the image header encoding and c2 is the text header encoding. The artificial neural network compares the differences between the two header encodings and uses the softmax function to generate the probability of "image-text match" or "image-text non-match". The contrast loss L c Represented as: Where B represents batch, pair + and pair - These represent the positive sample loss and the negative sample loss, respectively, and sim() represents the similarity measure. and Let's represent the text global representation and image global representation of the b-th positive sample belonging to batch B. and Let represent the text global representation from the b1-th sample and the image global representation from the b2-th sample belonging to batch B, and τ represent the hyperparameter temperature coefficient. The multi-task loss L pt Represented as: Where, σ s σ c σ m These represent the uncertainty weights of the matching loss, contrast loss, and masking loss, respectively, and need to be optimized.
5. The method for predicting the intentions of traffic participants according to claim 2, characterized in that, The text description, after word segmentation, is mapped to embedded word vectors, including: using word2vec to embed the segmented words into word vectors e. pos The word vector and the positional encoding of the word vector PE pos and text modal features V t Forming the embedding vector t pos This can be expressed as a formula: Among them, text modal features V t This represents a trainable common encoded representation; The scene image is divided into sub-images and then mapped to embedded sub-image vectors, including: dividing the i-th sub-image x i It becomes an embedded subgraph vector p through linear mapping. i This can be expressed as a formula: p i =w f ·latten(x i ) Here, `flatten()` represents the stretching operation of the space matrix, and `w`... f This represents the trainable parameter matrix.
6. The method for predicting the intentions of traffic participants according to claim 1 or 2, characterized in that, The semantic analysis module encodes the input vector into a semantic vector based on a multi-head deep self-attention mechanism, including: The input vector is processed by a multi-head deep self-attention mechanism to obtain a coupled representation vector. This representation vector is encoded by multiple layers of fully connected, residual connected, and layer regularization to mine the image-text aligned coupled representation as a semantic vector.
7. The method for predicting the intentions of traffic participants according to claim 1, characterized in that, The cue word matching model uses a constructed cue word library to match optimal cue words for keywords extracted from traffic scene description sequences, including: Multiple keywords are extracted from the traffic scene description sequence, and the average embedding word vector of the multiple keywords is used as the scene description feature. The prompt word matching model maps the scene description feature to the prompt word space of the prompt word library through a shallow nonlinear fully connected network, and selects the prompt word with the highest probability as the optimal prompt word.
8. The method for predicting the intentions of traffic participants according to claim 1, characterized in that, The pre-trained language model includes a scenario-based question answering module and an intent prediction module; The scenario question answering module understands the scenario content based on the combination of the optimal input prompt words and the traffic scenario description sequence, and constructs inference prompt words. The intent prediction module performs inference based on the inference prompt words to predict traffic intent and provides feedback in natural language.
9. The method for predicting the intentions of traffic participants according to claim 1, characterized in that, The scene description model, prompt word matching model, and pre-trained language model undergo collaborative asynchronous fine-tuning before being applied. The specific process includes: The scene description model is used as the description end, and the prompt word matching model and the pre-trained language model are used as the language end; First, the parameters of the description-side model are fixed, and the language-side is only constrained to perform correct intent reasoning under the traffic scene description. This includes: inputting labeled intents into the pre-trained language model, predicting intents, driving the pre-trained language model to analyze intent differences based on the prompt word "What are the differences between the two intents?", driving the language model to analyze the knowledge gaps that cause the differences based on the prompt word "What additional knowledge needs to be supplemented to correctly predict intents?", and segmenting it into dialogue paradigms to implement the instructions and knowledge fine-tuning of the pre-trained language model. At the same time, for the prompt word matching model, the difference analysis index is used as a multi-task loss, and its backpropagation gradient is used to update the prompt word matching model parameters to assist in selecting the prompt words that produce the optimal results. Then, the parameters of the language-side model are fixed, and the parameters of the description-side model are optimized, including: optimizing the parameters of the scene description model based on the difference analysis index as the loss function; Among them, the difference analysis index is the difference between the labeled intent and the predicted intent.
Citation Information
Patent Citations
Street crossing intention detecting system and method for non-motor vehicles or pedestrians
CN109712388A
Pedestrian road-crossing intention recognition method based on gaze detection and traffic scene recognition
CN112329684A
Pedestrian intention analysis method and system
CN114550297A
Method for long-term trajectory prediction of traffic participants
CN113158539A
Method and apparatus for automatically generating inference questions and answers
WO2021184311A1