A method and apparatus for training a spatial trajectory model

By employing a two-stage training method involving supervised fine-tuning and reinforcement fine-tuning, the problem of inaccurate 3D spatial trajectory generation in existing technologies is solved, enabling precise training of the spatial trajectory model and ensuring that the generated trajectory meets actual requirements.

CN122242720APending Publication Date: 2026-06-19BEIJING ACAD OF ARTIFICIAL INTELLLIGENCE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ACAD OF ARTIFICIAL INTELLLIGENCE
Filing Date
2026-02-10
Publication Date
2026-06-19

Smart Images

  • Figure CN122242720A_ABST
    Figure CN122242720A_ABST
Patent Text Reader

Abstract

This invention provides a training method and apparatus for a spatial trajectory model, relating to the field of artificial intelligence technology. The method includes: acquiring target data; the target data includes a target image, target annotation data corresponding to the target image, and target instructions; obtaining a first spatial trajectory and scale information based on the target image, target instructions, and an original spatial trajectory model; performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain a second spatial trajectory model; inputting the target image and target instructions into the first spatial trajectory model, and using a reinforcement learning algorithm to perform reinforcement fine-tuning on the first spatial trajectory model to obtain a second spatial trajectory model. This implementation enhances the model's understanding of the actual scale of objects and effectively supervises key steps in spatial trajectory generation, ensuring the rationality of reasoning between multiple steps.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a training method and apparatus for a spatial trajectory model. Background Technology

[0002] In the fields of embodied intelligence and robotics, robots need to understand complex spatial instructions and execute corresponding motion trajectories to accurately complete spatial tasks. This means that the underlying models need to have strong spatial trajectory understanding capabilities and be able to achieve a certain level of accuracy.

[0003] Existing model training methods mainly focus on 2D visual language models, meaning the generated trajectories are mostly two-dimensional and lack understanding of three-dimensional space. This makes these models unable to accurately perform tasks in three-dimensional space. Moreover, the generation of spatial trajectories often requires multiple steps, and existing spatial trajectory models typically neglect supervision of each step during training, resulting in generated trajectories that fail to meet users' actual needs. Summary of the Invention

[0004] This invention provides a training method and apparatus for a spatial trajectory model, addressing the problem in existing technologies where models cannot accurately obtain three-dimensional spatial trajectories. Specifically, this invention includes two adjustment stages: supervised fine-tuning and reinforcement fine-tuning. In supervised fine-tuning, by adding scale information during model training and performing supervised fine-tuning based on this scale information, the model's understanding of the actual scale of objects can be enhanced. In reinforcement fine-tuning, reinforcement learning algorithms effectively supervise the key aspects of each step in spatial trajectory generation, ensuring the rationality of reasoning between multiple steps.

[0005] This invention provides a training method for a spatial trajectory model, comprising the following steps: acquiring target data; wherein the target data includes a target image, target annotation data corresponding to the target image, and target instructions; obtaining a first spatial trajectory and scale information based on the target image, the target instructions, and an original spatial trajectory model; wherein the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain a first spatial trajectory model; inputting the target image and the target instructions into the first spatial trajectory model, and using a reinforcement learning algorithm to perform reinforcement fine-tuning on the first spatial trajectory model to obtain a second spatial trajectory model.

[0006] Optionally, obtaining the first spatial trajectory and scale information based on the target image, the target instruction, and the original spatial trajectory model includes: inputting the target image and the target instruction into the original spatial trajectory model and outputting the first spatial trajectory; and predicting the scale information based on the first spatial trajectory.

[0007] Optionally, the target image includes an RGB image and a depth image; the step of inputting the target image and the target instruction into the original spatial trajectory model and outputting a first spatial trajectory includes: encoding the RGB image using a visual encoder in the original spatial trajectory model to obtain image features, and encoding the depth image using a spatial encoder in the original spatial trajectory model to obtain geometric features; inputting the image features, the geometric features, and the target instruction into a large language model in the original spatial trajectory model and outputting the first spatial trajectory; The step of predicting the scale information based on the first spatial trajectory includes: using the scale decoder in the original spatial trajectory model to decode the scale prediction head in the first spatial trajectory and predict the scale information.

[0008] Optionally, the step of inputting the image features, the geometric features, and the target instruction into the large language model in the original spatial trajectory model and outputting the first spatial trajectory includes: inputting the image features into the RGB projection layer in the original spatial trajectory model and inputting the geometric features into the spatial projection layer in the original spatial trajectory model to obtain the image input corresponding to the image features and the geometric input corresponding to the geometric features, respectively; inputting the image input, the geometric input, and the target instruction into the large language model and outputting the first spatial trajectory.

[0009] Optionally, the step of performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain the first spatial trajectory model includes: determining the prediction loss of the original spatial trajectory model based on the first spatial trajectory and the target spatial trajectory in the target annotation data; determining the scale regression loss of the original spatial trajectory model based on the scale information and the target scale information in the target annotation data; determining the total loss based on the prediction loss and the scale regression loss; and adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model.

[0010] Optionally, after determining the total loss based on the prediction loss and the scale regression loss, and before adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model, the method further includes: updating the parameters of the spatial projection layer based on the gradient results corresponding to the total loss; and / or updating the parameters of the scale information based on the scale regression loss.

[0011] Optionally, adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model includes: simultaneously adjusting the parameters of the large language model, the visual encoder, the scale encoder, the RGB projection layer, and the spatial projection layer in the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model.

[0012] Optionally, the step of using a reinforcement learning algorithm to enhance and fine-tune the first spatial trajectory model to obtain a second spatial trajectory model includes: inputting the target image and the target instruction into the first spatial trajectory model; generating multiple second spatial trajectories using a reinforcement learning algorithm; the second spatial trajectory is a three-dimensional trajectory; and enhancing and fine-tuning the first spatial trajectory model according to the multiple second spatial trajectories and a preset reward mechanism to obtain the second spatial trajectory model.

[0013] Optionally, the step of enhancing and fine-tuning the first spatial trajectory model based on multiple second spatial trajectories and a preset reward mechanism to obtain a second spatial trajectory model includes: calculating the reward score corresponding to each second spatial trajectory according to the preset reward mechanism; determining the relative advantage of each second spatial trajectory using the average reward corresponding to each second spatial trajectory as the reward baseline; and enhancing and fine-tuning the first spatial trajectory model based on the relative advantage of each second spatial trajectory to obtain the second spatial trajectory model.

[0014] Optionally, the reward mechanism includes: a result reward mechanism and a process reward mechanism; the step of calculating the reward score corresponding to each second spatial trajectory according to the reward mechanism includes: calculating a first score corresponding to the result reward mechanism and a second score corresponding to the process reward mechanism for each second spatial trajectory; and calculating the reward score corresponding to each second spatial trajectory based on the first score and the second score.

[0015] Optionally, calculating the first score corresponding to the result reward mechanism and the second score corresponding to the process reward mechanism for each second spatial trajectory includes: determining the first score corresponding to the result reward mechanism for each second spatial trajectory according to at least one of the following result reward strategies: whether the endpoint position corresponding to the second spatial trajectory is correct, and whether the output format of the second spatial trajectory is correct; and / or, determining the second score corresponding to the process reward mechanism for each second spatial trajectory according to at least one of the following process reward strategies: whether the predicted size of the target object conforms to a preset error range, and whether the predicted coordinates of the target object conform to a preset error range.

[0016] Optionally, the step of enhancing and fine-tuning the first spatial trajectory model based on the relative advantage of each of the second spatial trajectories to obtain the second spatial trajectory model includes: determining high-advantage trajectories and low-advantage trajectories from multiple second spatial trajectories based on the relative advantage, increasing the parameter probability corresponding to the high-advantage trajectory, and decreasing the parameter probability corresponding to the low-advantage trajectory, so as to enhance and fine-tune the first spatial trajectory model.

[0017] Optionally, the method further includes: during the process of strengthening and fine-tuning the first spatial trajectory model, using KL divergence in the reinforcement learning algorithm to constrain the degree of deviation in the update of the first spatial trajectory model.

[0018] The present invention also provides a training device for a spatial trajectory model, comprising the following modules: An acquisition module is used to acquire target data; wherein, the target data includes a target image, target annotation data corresponding to the target image, and target instructions; The prediction module is used to obtain a first spatial trajectory and scale information based on the target image, the target instruction, and the original spatial trajectory model; wherein, the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; The first adjustment module is used to perform supervised fine-tuning of the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain the first spatial trajectory model. The second adjustment module is used to input the target image and the target instruction into the first spatial trajectory model, and to perform reinforcement learning algorithm to fine-tune the first spatial trajectory model to obtain the second spatial trajectory model.

[0019] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement a training method for a spatial trajectory model as described above.

[0020] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method for the spatial trajectory model as described above.

[0021] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements a training method for a spatial trajectory model as described above.

[0022] The spatial trajectory model training method and apparatus provided by this invention employ two adjustment stages: supervised fine-tuning and reinforcement fine-tuning. In the supervised fine-tuning stage, scale information is added during model training, and supervised fine-tuning is performed based on this scale information, thereby enhancing the model's understanding of the actual scale of objects. In the reinforcement fine-tuning stage, reinforcement learning algorithms effectively supervise the key aspects of each step in spatial trajectory generation, ensuring the rationality of reasoning between multiple steps. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0024] Figure 1 This is one of the flowcharts illustrating the training method for the spatial trajectory model provided by this invention.

[0025] Figure 2 This is a schematic diagram of a specific process for providing prediction scale information according to the present invention.

[0026] Figure 3 This is a schematic diagram of a component structure of the original spatial trajectory model provided by the present invention.

[0027] Figure 4 This is a schematic diagram of another component structure of the original spatial trajectory model provided by the present invention.

[0028] Figure 5 This is another specific flowchart illustrating the prediction scale information provided by the present invention.

[0029] Figure 6This is a schematic diagram of a process for obtaining a first spatial trajectory model provided by the present invention.

[0030] Figure 7 This is a schematic diagram of a process for obtaining a second spatial trajectory model provided by the present invention.

[0031] Figure 8 This is a schematic diagram of the structure of the training device for the spatial trajectory model provided by the present invention.

[0032] Figure 9 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0033] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0034] Figure 1 This is one of the flowcharts illustrating the training method for the spatial trajectory model provided by this invention, such as... Figure 1 As shown, the method includes the following: Step 101: Obtain target data; wherein, target data includes target image, target annotation data corresponding to the target image, and target instructions; Target data is the training dataset used for model training. In this embodiment of the invention, target data may include multiple target images, and each target image corresponds to target annotation data and target instructions. Specifically, a target image refers to an image that includes at least one target object, such as an image of an apple (target object) on a table, or an image of multiple flower pots (target objects) on a balcony. When using different data types as training datasets, target data can also be point cloud data or video stream data. Target instructions are natural language descriptions of spatial constraints corresponding to the target image, such as "water the multiple flower pots on the balcony from left to right, with the watering can hovering at a height of 1-5 cm from the flower pots." Target annotation data is the annotation data for the precise coordinates and specific dimensions (length, width, and height) of objects in the target image. For example, annotation can be performed using existing automatic annotation tools or manually. This invention does not limit the specific annotation method. Step 102: Obtain the first spatial trajectory and scale information based on the target image, target instructions, and the original spatial trajectory model; wherein, the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory. In this context, a three-dimensional trajectory refers to a series of sequentially ordered three-dimensional coordinates (u, v, d) comprising two-dimensional plane pixels and an absolute depth d. Movement along the spatial trajectory is achieved by moving sequentially to each three-dimensional coordinate. Specifically, compared to the coordinates formed by two-dimensional planes in existing technologies, this invention specifically considers the absolute depth d in the spatial trajectory, so that the robot can directly perform tasks in three-dimensional space based on the three-dimensional trajectory.

[0035] For the original spatial trajectory model, its output is usually a spatial trajectory with a fixed output format. Therefore, in an optional embodiment of the present invention, step 102 further includes: inputting the target image and target instructions into the original spatial trajectory model to obtain a first spatial trajectory; and predicting the scale information based on the first spatial trajectory. That is, the scale information in this embodiment of the present invention is obtained based on the first spatial trajectory output by the model.

[0036] In one optional embodiment, the original spatial trajectory model is also trained based on training data. Specifically, the training data can be constructed based on a static scene dataset. By combining a static 3D scan dataset and advanced motion planning techniques, high-quality motion planning trajectory data is obtained, thereby improving the robot model's spatial trajectory understanding ability. In a further optional embodiment, the process of constructing the training data includes: determining a static scene based on the static scene dataset; sampling in the static scene to determine the collision-free motion trajectory planning endpoint; determining the collision-free motion trajectory planning starting point at the escape starting position using an initial escape mechanism based on the starting position in the static scene; and performing active obstacle avoidance planning based on the starting position, the collision-free motion trajectory planning starting point at the escape starting position, and the collision-free motion trajectory planning endpoint to generate training data.

[0037] It is understandable that there is a scaling ratio between the size of the target object in the image and its actual size. Since the model itself cannot know the actual size of the target object during the image recognition process, it can only make predictions. This leads to the problem of whether the predicted value is consistent with the actual value. Therefore, in this embodiment of the invention, scale information is added for model training in order to improve the perception of the size of the target object. In this way, the generated spatial trajectory is a spatial trajectory based on the actual size of the target object, which can effectively avoid other objects in the target image and avoid collisions with other objects during the movement of the target object.

[0038] In an optional embodiment, the target image includes an RGB image and a depth image. The process of obtaining scale information based on the target image, target instructions, and the original spatial trajectory model in step 102 can be as follows: Figure 2 As shown, it includes: Step 201: Encode the RGB image using the visual encoder in the original spatial trajectory model to obtain image features, and encode the depth image using the spatial encoder in the original spatial trajectory model to obtain geometric features; Step 202: Input the image features, geometric features, and target instructions into the large language model in the original spatial trajectory model, and output the first spatial trajectory; Step 203: Use the scale decoder in the original spatial trajectory model to decode the scale prediction head in the first spatial trajectory and predict the scale information.

[0039] For the target image, it can be either an RGB image (i.e., a photographic image in the conventional sense) or both an RGB image and a depth image. Without a depth image, the original spatial trajectory model predicts the absolute depth based on the RGB image and outputs a 3D trajectory. However, if both an RGB image and a depth image are input simultaneously, the geometric features can be directly obtained from the depth image, leading to a more accurate absolute depth and 3D trajectory. Therefore, in this embodiment of the invention, it is preferable to input both an RGB image and a depth image simultaneously, which enables better perception of absolute depth.

[0040] In this embodiment of the invention, the original spatial trajectory model can be a visual language model (VLM), which includes multiple components such as Figure 3 As shown, the components are a visual encoder, a spatial encoder, a Large Language Model (LLM), and a scale decoder. It's understandable that the Large Language Model, as a deep learning model trained on large amounts of text data, can only generate natural language text and understand its meaning; it cannot directly recognize images. Therefore, a visual encoder and a spatial encoder are needed to encode the RGB and depth images respectively to obtain image features and geometric features represented by machine language. Then, the first spatial trajectory is output based on these image and geometric features. Furthermore, the Large Language Model (LLM) also outputs text tokens; that is, the generated first spatial trajectory is actually represented by a series of text tokens, each text token corresponding to a three-dimensional spatial coordinate on the trajectory.

[0041] The scale decoder is a multi-layer perceptual structure. Since the original spatial trajectory is an initial model that has not yet been trained or optimized, it needs to predict an initial value for scale information based on the first spatial trajectory. This value is obtained based on the scale prediction head in the Large Language Model (LLM). Specifically, the scale decoder recognizes the text characters output by the Large Language Model. When it recognizes special symbols (…), it… <scale>When using a token, read... <scale>The value corresponding to the token provides the scale information.

[0042] Furthermore, although image features and geometric features are already language types that machine language can understand, for Large Language Models (LLMs), there is still a matching problem between geometric features and language semantics. Therefore, the Visual Language Model (VLM) in this embodiment of the invention also includes an RGB projection layer and a spatial projection layer, such as... Figure 4 As shown. The RGB projection layer and spatial projection layer enable the Large Language Model (LLM) to understand the spatial encoder and recognize the image and geometric features transmitted by the encoder, ensuring that the meaning understood by the LLM is the same as or similar to the meaning of the image and geometric features output by the visual encoder and spatial encoder. Therefore, in an optional embodiment, the process of predicting scale information in this invention is as follows: Figure 5 As shown, it specifically includes: Step 501: Obtain target data; wherein, the target data includes RGB image, depth image, target annotation data corresponding to the target image, and target instructions; Step 502: Encode the RGB image using the visual encoder in the original spatial trajectory model to obtain image features, and encode the depth image using the spatial encoder in the original spatial trajectory model to obtain geometric features; Step 503: Input the image features into the RGB projection layer in the original spatial trajectory model, and input the geometric features into the spatial projection layer in the original spatial trajectory model, to obtain the image input corresponding to the image features and the geometric input corresponding to the geometric features, respectively. Step 504: Input the image input, geometric input, and target instruction into the large language model and output the first spatial trajectory.

[0043] Step 505: Use the scale decoder in the original spatial trajectory model to decode the scale prediction head in the first spatial trajectory and predict the scale information. Through the above steps, the first spatial trajectory can be output based on the target image and target instructions, and the initial values ​​for scale information are set.

[0044] Step 103: Based on the scale information, the first spatial trajectory, and the target annotation data, perform supervised fine-tuning on the original spatial trajectory model to obtain the first spatial trajectory model; In this step, the present invention utilizes two loss functions, prediction loss and scaling regression loss, to perform supervised fine-tuning of the original spatial trajectory model. The specific fine-tuning process can be described as follows: Figure 6 As shown, it includes: Step 601: Determine the prediction loss of the original spatial trajectory model based on the first spatial trajectory and the target spatial trajectory in the target annotation data; Step 602: Determine the scale regression loss of the original spatial trajectory model based on the scale information and the target scale information in the target annotation data; Step 603: Determine the total loss based on the predicted loss and the scaling regression loss; Step 604: Adjust the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model.

[0045] It is understandable that the target spatial trajectory and the target scale information have already been labeled in the target data. Therefore, the prediction loss and scale regression loss can be calculated directly based on the difference between the first spatial trajectory output by the original spatial trajectory model and the target spatial trajectory, as well as the difference between the scale information and the target scale information.

[0046] For example, the prediction loss can be calculated by comparing the probability distribution of each text character (Token) in the first spatial trajectory with that of each target text character in the target spatial trajectory. The scale regression loss can be calculated by the squared difference (mean squared error) between the logarithm of the scale information and the logarithm of the target scale information.

[0047] In an alternative embodiment, the total loss = prediction loss + 0.1 * scale regression loss, because the accuracy of the spatial trajectory is more important than the scaling of the object, and therefore the prediction loss has a greater weight.

[0048] In this invention, since the scale information is only set to an initial value, the scaling ratio between the size of the target object in the image and its actual size is not accurate, and neither the RGB projection layer nor the spatial projection layer has been updated. Therefore, directly adjusting the original spatial trajectory model to minimize the total loss at this point may still result in low accuracy, failing to meet user requirements. Thus, in a further optional embodiment, after step 603 and before step 604, the following steps may be included: updating the parameters of the spatial projection layer based on the gradient result corresponding to the total loss; and updating the parameters of the scale decoder (i.e., the relevant model used to predict the scale regression loss) based on the scale regression loss. It is evident that by first updating the RGB projection layer and the spatial projection layer based on the total loss, more accurate image and geometric inputs can be obtained. Simultaneously, updating the scale decoder based on the scale regression loss yields a more accurate scaling ratio. In other words, only when the input data and scaling ratio are accurate can a more accurate output result be obtained.

[0049] It should be noted here that when updating the spatial projection layer and the scale decoder, it is necessary to ensure that other parameters in the visual language model (VLM) remain unchanged, that is, all parameters in the visual encoder, spatial encoder and large language model (LLM) remain unchanged, in order to ensure metric alignment.

[0050] After the spatial projection layer and scale decoder are updated, in one optional embodiment, with the goal of minimizing the total loss, the parameters of the large language model, visual encoder, scale encoder, RGB projection layer, and spatial projection layer in the original spatial trajectory model are simultaneously adjusted to obtain the first spatial trajectory model. In other words, the spatial projection layer and scale decoder actually undergo two parameter adjustments: the first adjusts only their two independent parameters, and the second adjusts them jointly with other components in the visual language model (VLM), keeping only the spatial encoder parameters unchanged to preserve strong 3D recognition performance.

[0051] Step 104: Input the target image and target command into the first spatial trajectory model, and use reinforcement learning algorithm to strengthen and fine-tune the first spatial trajectory model to obtain the second spatial trajectory model.

[0052] In summary, this invention adjusts the parameters of the original spatial trajectory model based on the perspective of the loss function, and supervises each step in the spatial trajectory generation process through step 104, thereby ensuring the rationality and practicality of the output spatial trajectory.

[0053] Specifically, in one optional embodiment, the process of obtaining the second spatial trajectory model includes: inputting the target image and target command into the first spatial trajectory model, and generating multiple second spatial trajectories using a reinforcement learning algorithm; the second spatial trajectories are three-dimensional trajectories; and, based on the multiple second spatial trajectories and a preset reward mechanism, performing reinforcement fine-tuning on the first spatial trajectory model to obtain the second spatial trajectory model. Here, the second spatial trajectory and the first spatial trajectory are simply different outputs of the original spatial trajectory model and the first spatial trajectory model, and do not represent priority or a necessary relationship between the spatial trajectories. The reinforcement learning algorithm can be the Group Relative Policy Optimization (GRPO) algorithm, which is specifically designed to improve the reasoning ability of large language models. By calculating the policy gradient through comparison of samples within a group, it can significantly improve the reasoning ability of large language models.

[0054] In an optional embodiment, the specific process of obtaining the second spatial trajectory model according to the present invention can be as follows: Figure 7 As shown, it includes: Step 701: Input the target image and target command into the first spatial trajectory model, and use reinforcement learning algorithm to generate multiple second spatial trajectories; Step 702: Calculate the reward score corresponding to each second space trajectory according to the preset reward mechanism; Step 703: Using the average reward corresponding to each second space trajectory as the reward baseline, determine the relative advantage of each second space trajectory; Step 704: Based on the relative advantages of each second spatial trajectory, the first spatial trajectory model is enhanced and fine-tuned to obtain the second spatial trajectory model.

[0055] Specifically, based on relative advantage, high-dominance and low-dominance trajectories can be determined from multiple second-space trajectories. The parameter probabilities corresponding to high-dominance trajectories can be increased, while the parameter probabilities corresponding to low-dominance trajectories can be decreased, in order to enhance and fine-tune the first-space trajectory model.

[0056] As can be seen, by generating multiple second spatial trajectories and determining the relative advantage of each second spatial trajectory based on its reward score and reward baseline, the embodiments of the present invention can further enhance and fine-tune the model based on its relative advantage.

[0057] Furthermore, the reward mechanism in this invention includes a result reward mechanism and a process reward mechanism. The process of calculating the reward score in step 702 specifically includes: calculating the first score corresponding to the result reward mechanism and the second score corresponding to the process reward mechanism for each second spatial trajectory; and calculating the reward score corresponding to each second spatial trajectory based on the first score and the second score.

[0058] Furthermore, the first score corresponding to each second spatial trajectory and the result reward mechanism is determined according to at least one of the following result reward strategies: whether the endpoint position corresponding to the second spatial trajectory is correct, and whether the output format of the second spatial trajectory is correct; the second score corresponding to each second spatial trajectory and the process reward mechanism is determined according to at least one of the following process reward strategies: whether the predicted size of the target object conforms to the preset error range, and whether the predicted coordinates of the target object conform to the preset error range.

[0059] For example, the process of determining the first score based on whether the endpoint position corresponding to the second spatial trajectory is correct may include: calculating the Euclidean distance (straight-line distance) between the start and end points of the second spatial trajectory and the actual start and end points; if the start and end points of the second spatial trajectory coincide with the actual start and end points, then the first score is determined to be 1; the greater the distance, the smaller the first score (up to 0). By setting this result reward strategy, the model can be encouraged to ensure that the start and end points are accurate when generating spatial trajectories.

[0060] For example, the process of determining the first score based on whether the output format of the second spatial trajectory is correct may include: checking whether the second spatial trajectory output by the first spatial trajectory model meets a preset format (e.g., whether it uses...). <think>and <answer>(The content enclosed in the tag) If the format is completely correct, the first score is set to 1; otherwise, the first score is 0. By setting this result reward strategy, structured output results can be guaranteed, which facilitates subsequent correlation analysis.

[0061] For example, the preset error range can be -30% to 30%, which means that the predicted size and coordinates of the target object need to be within a certain accuracy. If the predicted size and coordinates of the target object both meet the preset error range, the second score is set to 1; otherwise, the second score is set to 0. This reward strategy forces the model to learn precise physical measurements rather than vague physical descriptions.

[0062] It should also be noted that, to ensure that the adjustment range between the second spatial trajectory model and the first spatial trajectory model is not too large, in an optional embodiment of the present invention, during the fine-tuning of the first spatial trajectory model, the KL divergence in the reinforcement learning algorithm is used to constrain the degree of deviation of the update of the first spatial trajectory model. KL divergence (Kullback-Leibler divergence) is an important tool for measuring the difference between two probability distributions. By directly incorporating KL divergence as a regularization term into the loss function, the deviation between the learning strategy and the reference model can be effectively constrained, preventing the model from deviating too much from its initial state during training.

[0063] In summary, the spatial trajectory model training method provided in this embodiment of the invention, through two adjustment stages—supervised fine-tuning and reinforcement fine-tuning—enhances the model's understanding of the actual scale of objects by adding scale information during the model training process and performing supervised fine-tuning based on this scale information. In the reinforcement fine-tuning process, reinforcement learning algorithms effectively supervise the key aspects of each step in spatial trajectory generation, ensuring the rationality of reasoning between multiple steps.

[0064] The training apparatus for the spatial trajectory model provided by the present invention will be described below. The training apparatus for the spatial trajectory model described below and the training method for the spatial trajectory model described above can be referred to in correspondence.

[0065] like Figure 8 As shown, the training device 800 for the spatial trajectory model provided by the present invention includes: The acquisition module 801 is used to acquire target data; wherein, the target data includes a target image, target annotation data corresponding to the target image, and target instructions; The prediction module 802 is used to obtain a first spatial trajectory and scale information based on the target image, the target instruction, and the original spatial trajectory model; wherein, the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; The first adjustment module 803 is used to perform supervised fine-tuning of the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain the first spatial trajectory model. The second adjustment module 804 is used to input the target image and the target instruction into the first spatial trajectory model, and use a reinforcement learning algorithm to strengthen and fine-tune the first spatial trajectory model to obtain the second spatial trajectory model.

[0066] In an optional embodiment of the present invention, the target image includes an RGB image and a depth image; the prediction module 802 is further configured to: encode the RGB image using the visual encoder in the original spatial trajectory model to obtain image features, and encode the depth image using the spatial encoder in the original spatial trajectory model to obtain geometric features; input the image features, the geometric features, and the target instruction into the large language model in the original spatial trajectory model to output the first spatial trajectory; and decode the scale prediction head in the first spatial trajectory using the scale decoder in the original spatial trajectory model to predict the scale information.

[0067] In an optional embodiment of the present invention, the prediction module 802 is further configured to input the target image and the target instruction into the original spatial trajectory model and output a first spatial trajectory; and predict the scale information based on the first spatial trajectory.

[0068] In an optional embodiment of the present invention, the target image includes an RGB image and a depth image; the prediction module 802 is further configured to input the image features into the RGB projection layer of the original spatial trajectory model, and input the geometric features into the spatial projection layer of the original spatial trajectory model, to obtain the image input corresponding to the image features and the geometric input corresponding to the geometric features, respectively; and input the image input, the geometric input and the target instruction into the large language model to output the first spatial trajectory.

[0069] In an optional embodiment of the present invention, the first adjustment module 803 is further configured to: determine the prediction loss of the original spatial trajectory model based on the first spatial trajectory and the target spatial trajectory in the target annotation data; determine the scale regression loss of the original spatial trajectory model based on the scale information and the target scale information in the target annotation data; determine the total loss based on the prediction loss and the scale regression loss; and adjust the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model.

[0070] In an optional embodiment of the present invention, the first adjustment module 803 is further configured to, after determining the total loss based on the prediction loss and the scale regression loss, and before adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model, update the parameters of the spatial projection layer based on the gradient result corresponding to the total loss; and / or update the parameters of the scale information based on the scale regression loss.

[0071] In an optional embodiment of the present invention, the first adjustment module 803 is further configured to simultaneously adjust the parameters of the large language model, the visual encoder, the scale encoder, the RGB projection layer and the spatial projection layer in the original spatial trajectory model with the goal of minimizing the total loss, so as to obtain the first spatial trajectory model.

[0072] In an optional embodiment of the present invention, the second adjustment module 804 is further configured to input the target image and the target instruction into the first spatial trajectory model, wherein a plurality of second spatial trajectories are generated using a reinforcement learning algorithm; the second spatial trajectory is a three-dimensional trajectory; and the first spatial trajectory model is reinforced and fine-tuned according to the plurality of second spatial trajectories and a preset reward mechanism to obtain the second spatial trajectory model.

[0073] In an optional embodiment of the present invention, the second adjustment module 804 is further configured to: calculate the reward score corresponding to each second spatial trajectory according to the preset reward mechanism; determine the relative advantage of each second spatial trajectory by taking the average reward corresponding to each second spatial trajectory as the reward baseline; and strengthen and fine-tune the first spatial trajectory model according to the relative advantage of each second spatial trajectory to obtain the second spatial trajectory model.

[0074] In an optional embodiment of the present invention, the reward mechanism includes: a result reward mechanism and a process reward mechanism; the second adjustment module 804 is further configured to calculate a first score corresponding to the result reward mechanism and a second score corresponding to the process reward mechanism for each second spatial trajectory; and calculate a reward score corresponding to each second spatial trajectory based on the first score and the second score.

[0075] In an optional embodiment of the present invention, the second adjustment module 804 is further configured to determine a first score corresponding to the result reward mechanism for each second spatial trajectory according to at least one of the following result reward strategies: whether the endpoint position corresponding to the second spatial trajectory is correct, and whether the output format of the second spatial trajectory is correct; and / or, determine a second score corresponding to the process reward mechanism for each second spatial trajectory according to at least one of the following process reward strategies: whether the predicted size of the target object conforms to a preset error range, and whether the predicted coordinates of the target object conform to a preset error range.

[0076] In an optional embodiment of the present invention, the second adjustment module 804 is further configured to determine, based on the relative advantage, a high-advantage trajectory and a low-advantage trajectory from a plurality of second spatial trajectories, increase the parameter probability corresponding to the high-advantage trajectory, and decrease the parameter probability corresponding to the low-advantage trajectory, so as to enhance and fine-tune the first spatial trajectory model.

[0077] In an optional embodiment of the present invention, the second adjustment module 804 is further configured to constrain the degree of deviation of the update of the first spatial trajectory model by using KL divergence in the reinforcement learning algorithm during the process of strengthening and fine-tuning the first spatial trajectory model.

[0078] In summary, the training device for the spatial trajectory model provided in this embodiment of the invention, through two adjustment stages—supervised fine-tuning and reinforcement fine-tuning—enhances the model's understanding of the actual scale of objects by adding scale information during the model training process and performing supervised fine-tuning based on this scale information. In the reinforcement fine-tuning process, reinforcement learning algorithms effectively supervise the key aspects of each step in spatial trajectory generation, ensuring the rationality of reasoning between multiple steps.

[0079] Figure 9 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 9 As shown, the electronic device may include: a processor 910, a communications interface 920, a memory 930, and a communications bus 940, wherein the processor 910, the communications interface 920, and the memory 930 communicate with each other through the communications bus 940. The processor 910 can call logical instructions in the memory 930 to execute a training method for a spatial trajectory model. This method includes: acquiring target data; wherein the target data includes a target image, target annotation data corresponding to the target image, and target instructions; inputting the target image and the target instructions into an original spatial trajectory model; predicting scale information for a first spatial trajectory based on the output of the original spatial trajectory model; wherein the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain a first spatial trajectory model; inputting the target image and the target instructions into the first spatial trajectory model; and performing reinforcement learning algorithm to fine-tune the first spatial trajectory model to obtain a second spatial trajectory model.

[0080] Furthermore, the logical instructions in the aforementioned memory 930 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0081] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the training method for the spatial trajectory model provided by the above methods. The method includes: acquiring target data; wherein the target data includes a target image, target annotation data corresponding to the target image, and target instructions; inputting the target image and the target instructions into an original spatial trajectory model, and predicting scale information of the first spatial trajectory based on the output of the original spatial trajectory model; wherein the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain a first spatial trajectory model; inputting the target image and the target instructions into the first spatial trajectory model, and performing reinforcement fine-tuning on the first spatial trajectory model using a reinforcement learning algorithm to obtain a second spatial trajectory model.

[0082] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a training method for the spatial trajectory model provided by the methods described above. The method includes: acquiring target data; wherein the target data includes a target image, target annotation data corresponding to the target image, and target instructions; inputting the target image and the target instructions into an original spatial trajectory model, and predicting scale information for a first spatial trajectory based on the output of the original spatial trajectory model; wherein the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; performing supervised fine-tuning on the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain a first spatial trajectory model; inputting the target image and the target instructions into the first spatial trajectory model, and performing reinforcement fine-tuning on the first spatial trajectory model using a reinforcement learning algorithm to obtain a second spatial trajectory model.

[0083] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0084] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0085] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.< / answer> < / think> < / scale> < / scale>

Claims

1. A training method for a spatial trajectory model, characterized in that, include: Acquire target data; wherein, the target data includes a target image, target annotation data corresponding to the target image, and target instructions; A first spatial trajectory and scale information are obtained based on the target image, the target command, and the original spatial trajectory model; wherein, the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; the first spatial trajectory is a three-dimensional trajectory; Based on the scale information, the first spatial trajectory, and the target annotation data, the original spatial trajectory model is subjected to supervised fine-tuning to obtain the first spatial trajectory model. The target image and the target command are input into the first spatial trajectory model, and the first spatial trajectory model is fine-tuned using a reinforcement learning algorithm to obtain the second spatial trajectory model.

2. The training method according to claim 1, characterized in that, The step of obtaining the first spatial trajectory and scale information based on the target image, the target command, and the original spatial trajectory model includes: The target image and the target command are input into the original spatial trajectory model, and the first spatial trajectory is output. The scale information is obtained based on the prediction of the first spatial trajectory.

3. The training method according to claim 2, characterized in that, The target image includes an RGB image and a depth image; the step of inputting the target image and the target command into the original spatial trajectory model and outputting a first spatial trajectory includes: The RGB image is encoded using the visual encoder in the original spatial trajectory model to obtain image features, and the depth image is encoded using the spatial encoder in the original spatial trajectory model to obtain geometric features; The image features, geometric features, and target instructions are input into the large language model in the original spatial trajectory model to output the first spatial trajectory; The step of predicting the scale information based on the first spatial trajectory includes: The scale information is obtained by decoding the scale prediction head in the first spatial trajectory using the scale decoder in the original spatial trajectory model.

4. The training method according to claim 3, characterized in that, The step of inputting the image features, the geometric features, and the target instruction into the large language model in the original spatial trajectory model and outputting the first spatial trajectory includes: The image features are input into the RGB projection layer of the original spatial trajectory model, and the geometric features are input into the spatial projection layer of the original spatial trajectory model, so as to obtain the image input corresponding to the image features and the geometric input corresponding to the geometric features respectively; The image input, the geometric input, and the target instruction are input into the large language model to output the first spatial trajectory.

5. The training method according to claim 4, characterized in that, The step of performing supervised fine-tuning of the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain the first spatial trajectory model includes: The prediction loss of the original spatial trajectory model is determined based on the first spatial trajectory and the target spatial trajectory in the target annotation data; The scale regression loss of the original spatial trajectory model is determined based on the scale information and the target scale information in the target annotation data; The total loss is determined based on the predicted loss and the scaling regression loss. With the goal of minimizing the total loss, the original spatial trajectory model is adjusted to obtain the first spatial trajectory model.

6. The training method according to claim 5, characterized in that, After determining the total loss based on the predicted loss and the scaling regression loss, and before adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model, the method further includes: The parameters of the spatial projection layer are updated based on the gradient results corresponding to the total loss. And / or, The scale information is updated using the scale regression loss.

7. The training method according to claim 5, characterized in that, The step of adjusting the original spatial trajectory model with the goal of minimizing the total loss to obtain the first spatial trajectory model includes: With the goal of minimizing the total loss, the parameters of the large language model, the visual encoder, the scale encoder, the RGB projection layer, and the spatial projection layer in the original spatial trajectory model are adjusted synchronously to obtain the first spatial trajectory model.

8. The training method according to claim 1, characterized in that, The step of using reinforcement learning algorithms to fine-tune the first spatial trajectory model to obtain the second spatial trajectory model includes: The target image and the target command are input into the first spatial trajectory model, and a multiple second spatial trajectories are generated using a reinforcement learning algorithm; the second spatial trajectories are three-dimensional trajectories. Based on multiple second spatial trajectories and a preset reward mechanism, the first spatial trajectory model is enhanced and fine-tuned to obtain a second spatial trajectory model.

9. The training method according to claim 8, characterized in that, The step of enhancing and fine-tuning the first spatial trajectory model based on multiple second spatial trajectories and a preset reward mechanism to obtain a second spatial trajectory model includes: According to the preset reward mechanism, calculate the reward score corresponding to each second spatial trajectory; Using the average reward corresponding to each second spatial trajectory as the reward baseline, the relative advantage of each second spatial trajectory is determined respectively; Based on the relative advantages of each of the second spatial trajectories, the first spatial trajectory model is enhanced and fine-tuned to obtain the second spatial trajectory model.

10. The training method according to claim 9, characterized in that, The reward mechanism includes: a result reward mechanism and a process reward mechanism; calculating the reward score corresponding to each second spatial trajectory according to the reward mechanism includes: Calculate the first score corresponding to the result reward mechanism and the second score corresponding to the process reward mechanism for each of the second spatial trajectories; The reward score for each second spatial trajectory is calculated based on the first score and the second score.

11. The training method according to claim 10, characterized in that, The calculation of the first score corresponding to the result reward mechanism and the second score corresponding to the process reward mechanism for each of the second spatial trajectories includes: The first score corresponding to each second spatial trajectory and the result reward mechanism is determined according to at least one of the following result reward strategies: whether the endpoint position corresponding to the second spatial trajectory is correct, and whether the output format of the second spatial trajectory is correct; And / or, The second score corresponding to each second spatial trajectory and the process reward mechanism is determined according to at least one of the following process reward strategies: whether the predicted size of the target object conforms to a preset error range, and whether the predicted coordinates of the target object conform to a preset error range.

12. The training method according to claim 9, characterized in that, The step of enhancing and fine-tuning the first spatial trajectory model based on the relative advantages of each of the second spatial trajectories to obtain the second spatial trajectory model includes: Based on the relative advantage, high-dominance trajectories and low-dominance trajectories are determined from a plurality of second spatial trajectories. Increase the parameter probability corresponding to the high-dominance trajectory and decrease the parameter probability corresponding to the low-dominance trajectory to enhance and fine-tune the first spatial trajectory model.

13. The training method according to claim 8, characterized in that, Also includes: During the process of strengthening and fine-tuning the first spatial trajectory model, the KL divergence in the reinforcement learning algorithm is used to constrain the degree of deviation in the update of the first spatial trajectory model.

14. A training device for a spatial trajectory model, characterized in that, include: An acquisition module is used to acquire target data; wherein, the target data includes a target image, target annotation data corresponding to the target image, and target instructions; The prediction module is used to obtain scale information based on the target image, the target command, and the original spatial trajectory model; wherein, the scale information indicates the scaling degree between the actual size and the predicted size of the target object in the target image; The first adjustment module is used to perform supervised fine-tuning of the original spatial trajectory model based on the scale information, the first spatial trajectory, and the target annotation data to obtain the first spatial trajectory model. The second adjustment module is used to input the target image and the target instruction into the first spatial trajectory model, and to perform reinforcement learning algorithm to fine-tune the first spatial trajectory model to obtain the second spatial trajectory model.

15. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the training method as described in any one of claims 1 to 13.

16. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the training method as described in any one of claims 1 to 13.

17. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the training method as described in any one of claims 1 to 13.