Method, apparatus, device and storage medium for training machine learning model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a decompositional reinforcement learning method, spatial reasoning is broken down into multiple dimensions, which solves the illusion problem in three-dimensional and four-dimensional spatial reasoning of visual language models and improves the accuracy and consistency of the model.

CN122242765APending Publication Date: 2026-06-19JINGDONG TECH HLDG CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: JINGDONG TECH HLDG CO LTD
Filing Date: 2026-03-30
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Visual language models have limitations in reasoning tasks involving three-dimensional spatial structures and temporal dynamics. They are prone to illusions, rely on heuristics rather than reasoning based on depth information or temporal consistency, and struggle to recover dimensional information folded by camera projection.

Method used

A decompositional reinforcement learning approach is adopted to decompose spatial reasoning into planar spatial dimensions, depth spatial dimensions, and time dimensions. By acquiring training samples, identifying differences in indicators, and updating the parameters of the machine learning model, the accuracy and consistency of the model in three-dimensional and four-dimensional spatial reasoning can be improved.

Benefits of technology

It effectively recovers the dimensional information folded by camera projection, improves the accuracy and consistency of machine learning models in dynamic space reasoning, and overcomes the illusion problem of traditional models in dynamic space reasoning.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242765A_ABST

Patent Text Reader

Abstract

According to embodiments of this disclosure, a method, apparatus, device, and storage medium for training a machine learning model are provided. The method includes: acquiring training samples, the training samples including sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries; using a machine learning model to be trained, determining a predicted response to the sample queries based on the sample visual information; determining at least one of a first metric, a second metric, and a third metric based at least on the truth responses and the predicted responses, the first metric indicating a difference between the truth responses and the predicted responses in a planar spatial dimension, the second metric indicating a difference between the truth responses and the predicted responses in a depth spatial dimension, and the third metric indicating a difference between the truth responses and the predicted responses in a temporal dimension; and updating the parameters of the machine learning model based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The exemplary embodiments disclosed herein generally relate to the field of information technology, and particularly to methods, apparatus, devices, computer-readable storage media, and computer program products for training machine learning models. Background Technology

[0002] Visual language models have demonstrated strong capabilities in multimodal tasks, but limitations remain in reasoning tasks involving 3D spatial structure and temporal dynamics. Some techniques enhance spatial understanding by using supervised fine-tuning with synthetic large-scale spatial question-answering datasets or by introducing additional spatial labels. However, single-view-based question-answering learning can lead to illusions in dynamic spatial reasoning, with models tending to rely on heuristics to infer spatial relationships rather than reasoning based on depth information or temporal consistency, thus struggling to recover dimensional information folded by camera projection. Summary of the Invention

[0003] In a first aspect of this disclosure, a method for training a machine learning model is provided. The method includes: acquiring training samples, the training samples including sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries; using a machine learning model to be trained, determining a predicted response to the sample queries based on the sample visual information; determining at least one of a first metric, a second metric, and a third metric based at least on the truth responses and the predicted responses, the first metric indicating a difference between the truth responses and the predicted responses in a planar spatial dimension, the second metric indicating a difference between the truth responses and the predicted responses in a depth spatial dimension, and the third metric indicating a difference between the truth responses and the predicted responses in a temporal dimension; and updating the parameters of the machine learning model based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

[0004] In a second aspect of this disclosure, an apparatus for training a machine learning model is provided. The apparatus includes: an acquisition module configured to acquire training samples, the training samples including sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries; a first determination module configured to determine a predicted response to the sample queries based on the sample visual information using a machine learning model to be trained; a second determination module configured to determine at least one of a first metric, a second metric, and a third metric based at least on the truth responses and the predicted responses, the first metric indicating a difference between the truth responses and the predicted responses in a planar spatial dimension, the second metric indicating a difference between the truth responses and the predicted responses in a depth spatial dimension, and the third metric indicating a difference between the truth responses and the predicted responses in a temporal dimension; and an update module configured to update the parameters of the machine learning model based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

[0005] In a third aspect of this disclosure, an electronic device is provided. The electronic device includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method of the first aspect of this disclosure when executed by the at least one processing unit.

[0006] In a fourth aspect of this disclosure, a computer-readable storage medium is provided. This computer-readable storage medium stores a computer program that can be executed by a processor to perform the method according to a first aspect of this disclosure.

[0007] In a fifth aspect of this disclosure, a computer program product is provided, which is tangibly stored in a computer storage medium and includes computer-executable instructions that, when executed by a device, cause the device to perform the method according to a first aspect of this disclosure.

[0008] It should be understood that the content described in this section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein: Figure 1 A schematic diagram of an example environment in which embodiments of the present disclosure can be implemented is shown; Figure 2A and Figure 2BSchematic diagrams of example architectures for training machine learning models according to some embodiments of the present disclosure are shown respectively; Figure 3 A schematic diagram illustrating an example of a first training sample according to some embodiments of the present disclosure is shown; Figure 4 A schematic diagram illustrating an example of a second training sample according to some embodiments of the present disclosure is shown; Figure 5 A schematic diagram illustrating an example of a third training sample according to some embodiments of the present disclosure is shown; Figure 6 A flowchart illustrating a process for training a machine learning model according to some embodiments of the present disclosure is shown; Figure 7 A block diagram of an apparatus for training a machine learning model according to some embodiments of the present disclosure is shown; and Figure 8 A block diagram of an electronic device in which one or more embodiments of the present disclosure may be implemented is shown. Detailed Implementation

[0010] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0011] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.

[0012] As an optional but non-limiting implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0013] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0014] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0015] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0016] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below.

[0017] As used in this disclosure, the term "visual language model" refers to a machine learning model capable of handling multimodal tasks, which receives visual and linguistic information as input and generates corresponding text output.

[0018] As used in this disclosure, the term "four-dimensional spatial-temporal intelligence" refers to the ability to jointly reason about three-dimensional geometry and temporal dynamics, including the understanding of geometric properties (e.g., distance, orientation, occlusion, topology) and temporal dynamics (e.g., motion, interaction, and state transitions).

[0019] As used in this disclosure, the term "planar correspondence" refers to establishing a correspondence between pixels or points between different views to determine the corresponding position of a point in one view in another view.

[0020] As used in this disclosure, the term "depth consistency" refers to the ability to reason about the relative depth order of objects in a scene to determine the front-to-back relationship of objects in three-dimensional space.

[0021] The term “time reversibility” as used in this disclosure means that reasoning from the starting view to the ending view should be logically reversible when querying in the reverse direction, in order to ensure consistency in the time dimension.

[0022] As used in this disclosure, the term "verifiable reward" refers to a reward derived directly from truth correctness, rather than a reward model that relies on learning, and is used to provide objective and verifiable supervisory signals during reinforcement learning.

[0023] As used in this disclosure, the term "overlap mask" refers to a binary mask used to identify pixels that are actually visible from one view to another in a given camera configuration, used to exclude correspondences of occluded or non-overlapping regions when calculating planar correspondence rewards.

[0024] As used in this disclosure, the term "Kendall rank correlation coefficient" refers to a statistic used to measure the consistency between a predicted depth ranking and a true depth ranking, calculated based on the number of consistent and inconsistent pairs.

[0025] Visual language models have demonstrated exceptional capabilities in general multimodal tasks; however, they suffer from a fundamental "flattening" problem when it comes to reasoning about the physical world. This spatial bottleneck stems from a deep dimensional mismatch: visual language models are trained to interpret two-dimensional projections, while true spatial reasoning requires recovering the underlying three-dimensional geometry and temporal continuity. Spatial reasoning ability is a fundamental component of visual language models for achieving real-world general artificial intelligence. Although humans can easily recognize spatial relationships in continuous visual environments, current visual language models still struggle with basic spatial queries that require interpreting a dynamic world beyond two-dimensional projections. This limitation severely restricts the application of visual language models in fields such as autonomous driving, robot navigation, and world modeling.

[0026] To mitigate this issue, many studies have synthesized large-scale spatial question-answering datasets for supervised fine-tuning or explored the introduction of additional spatial labels. However, these methods are often limited by their reliance on explicit 3D data and perform poorly when transferred to 4D scenes. More critically, although these methods sample images from 4D scenes, they heavily rely on single-view-based question-answering learning. This static learning approach based on 2D patterns can induce illusions in dynamic spatial reasoning, where models tend to heuristically infer spatial relationships, often making premature judgments rather than reasoning based on depth information or temporal consistency.

[0027] Recently, some researchers have begun exploring the application of reinforcement learning with verifiable rewards to enhance the spatial reasoning capabilities of visual language models. Reinforcement learning with verifiable rewards has shown superior generalization ability compared to supervised fine-tuning. However, formulating a unified four-dimensional objective in space and time remains challenging, as it is difficult to effectively recover the dimensional information folded during the process of camera projection from reality to observation.

[0028] In the field of spatial intelligence, traditional approaches improve spatial intelligence in three ways: architectural methods introduce geometric biases or 3D representations, or utilize external perception tools; data expansion methods scale up spatial supervision; and reasoning-oriented frameworks enhance structured spatial reasoning. However, these approaches fail to simultaneously provide a feasible learning path for acquiring spatial intelligence through both deep and temporal analysis.

[0029] In the field of multimodal reinforcement learning, traditional methods have moved beyond simple outcome rewards and are evolving towards a more structured and robust reinforcement learning paradigm. Reinforcement learning for spatial intelligence primarily focuses on how to stably improve spatial reasoning through verifiable rewards. Some methods utilize single-view... Figure 1 Consistent or dense spatial rewards incorporate spatial relationships into reinforcement learning objectives. Research indicates that free-form reasoning is susceptible to shortcut learning, thus introducing intermediate verifiable stages such as description or region localization. The importance of perception-oriented rewards is also emphasized. Simultaneously, progressive training methods proceed step-by-step from perception to reasoning, while the exploration of explicit 3D representations aims to achieve better generalization. Overall, traditional methods demonstrate that relying solely on long-chain reasoning or sparse rewards is insufficient, prompting the development of structured solutions employing explicit representations and multiple verifiable rewards.

[0030] In view of this, embodiments of the present disclosure propose an improved scheme for training a machine learning model. In this scheme, training samples are obtained, including sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries. A machine learning model to be trained (e.g., a visual language model) is used to determine a predicted response to the sample queries based on the sample visual information. At least one of a first metric, a second metric, and a third metric is determined based on at least the truth response and the predicted response. The first metric indicates the difference between the truth response and the predicted response in a planar spatial dimension, the second metric indicates the difference between the truth response and the predicted response in a depth spatial dimension, and the third metric indicates the difference between the truth response and the predicted response in a temporal dimension. Subsequently, the parameters of the machine learning model are updated based on at least one of the first, second, and third metric to obtain a trained machine learning model.

[0031] In this way, a decompositional reinforcement learning method is adopted, which decomposes spatial reasoning into three complementary dimensions: planar spatial dimension, depth spatial dimension, and temporal dimension. This method overcomes the problem that traditional visual language models are prone to illusions in dynamic spatial reasoning and tend to infer spatial relationships heuristically rather than based on depth information or temporal consistency. It can effectively recover dimensional information folded by camera projection, which is conducive to improving the accuracy and consistency of machine learning models in three-dimensional and four-dimensional spatial reasoning tasks.

[0032] The following description, in conjunction with the accompanying drawings, details embodiments of this disclosure.

[0033] Figure 1 A schematic diagram of an example environment 100 for model training and application according to some embodiments of the present disclosure is shown. Figure 1The example environment 100 illustrates three distinct phases of the machine learning model 105, including a pre-training phase 102, a training phase 104, and an application phase 106. A testing phase, not shown in the figure, may also occur after pre-training phase 102 or training phase 104.

[0034] Machine learning model 105 can be based on any suitable model architecture, including but not limited to Transformer models, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), and so on. In some examples, machine learning model 105 can be based on a language model (LM), for example. Machine learning model 105 can include language models, Large Language Models (LLMs), Vision-Language Models (VLMs), Multimodal Large Language Models (MLLMs), and so on. Language models, by learning from large corpora, possess question-answering capabilities. Machine learning model 105 can also be based on other suitable models. By way of example only, machine learning model 105 can also be based on a diffusion model (DM). It should be noted that the machine learning model 105 can also be a machine learning model obtained by combining multiple machine learning models. For example, the machine learning model can be a machine learning model obtained by combining VLM and DM, where VLM can be used to perform semantic understanding on the received input, and its output semantic understanding can be used as guidance information for DM.

[0035] Example environment 100 involves model pre-training system 110, model training system 120, and model application system 130. In the pre-training phase 102, model pre-training system 110 is configured to perform pre-training of machine learning model 105 using pre-training dataset 112. At the start of pre-training, the individual components of machine learning model 105 may have initial parameter values. The pre-training process involves updating the parameter values of machine learning model 105 to desired values based on data in pre-training dataset 112. The pre-training task is used to assist in updating the parameters of machine learning model 105.

[0036] During the pre-training phase 102, the machine learning model 105 can learn strong generalization capabilities using a pre-training dataset 112 that includes a large amount of data. After pre-training is complete, the parameter values of the machine learning model 105 have been updated to include the pre-trained parameter values. In some embodiments, the pre-trained machine learning model 105 possesses basic image processing capabilities and can process the input image based on received processing instructions and the input image to generate a corresponding output image.

[0037] A pre-trained machine learning model 105 can be provided to the training phase 104, where it is trained by the model training system 120 for different downstream tasks. In the training phase 104, the parameter values of the machine learning model 105 are further adjusted using the training dataset 122. During training, corresponding training algorithms are also used to update and adjust the parameters of the machine learning model 105. Since the model has learned a great deal from the training data in the pre-training phase, a downstream task model that meets expectations can be obtained using only a small amount of training data in the training phase 104. In some embodiments, the pre-training dataset 112 may include multiple pre-training samples for general scenarios, and the training dataset 122 may include multiple training samples for specific scenarios and specific tasks. In some embodiments, the trained machine learning model 105 possesses image spatial editing capabilities, enabling it to perform spatial transformation operations on objects in the input image and / or adjust the acquisition viewpoint of the virtual camera corresponding to the input image with relatively high accuracy.

[0038] In some embodiments, a testing phase may be included after the training phase 104, where the performance of the machine learning model 105 can be further tested using a test dataset. The dataset used in the testing phase is of the same type as that used in the training phase.

[0039] In application phase 106, the obtained machine learning model 105 has trained parameter values and can be provided to the model application system 130 for use. In application phase 106, the machine learning model 105 can be used to process the corresponding model input 132 in the real-world scenario and provide the corresponding model output 134. As an example only, the model input 132 may include processing instructions and input images, and the model output 188 may include the corresponding output images.

[0040] exist Figure 1In this system, the model pre-training system 110, the model training system 120, and the model application system 130 can be deployed on any suitable electronic device. This electronic device can be any type of computing-capable device, including terminal devices or server devices. The terminal device can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination of the foregoing, including accessories and peripherals of these devices or any combination thereof.

[0041] The server-side device can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data and artificial intelligence platforms. Server-side devices may include, for example, computing systems / servers, such as mainframes, edge computing nodes, computing devices in cloud environments, etc. It is understood that the model pre-training system 110, the model training system 120, and the model application system 130 can be deployed on the same electronic device or on different electronic devices; this disclosure does not limit this.

[0042] It should be understood that Figure 1 The components and arrangements shown in the example environment 100 are merely examples, and a computing system suitable for implementing the exemplary implementations described in this disclosure may include one or more different components, other components, and / or different arrangements. For example, although shown as separate, the model pre-training system 110, the model training system 120, and the model application system 130 may be integrated in the same system or device. For example, at least the model pre-training system 110 and the model training system 120 may be integrated in the model training system or device. Implementations of this disclosure are not limited in this respect.

[0043] The following description of the example will continue with reference to the accompanying drawings.

[0044] Figure 2A A schematic diagram of an example architecture 200A for training a machine learning model 105 according to some embodiments of the present disclosure is shown. Figure 2AAs shown, example architecture 200A illustrates a progressive training framework for machine learning model 105. This progressive training framework comprises two training phases. In the first phase, supervised fine-tuning of the machine learning model 105 through diverse spatial tasks builds its basic spatial awareness capabilities. The second phase introduces a decomposition reinforcement learning method with verifiable rewards (accuracy, XY coordinates, Z-axis, T-axis) to focus on improving correspondence recognition, depth perception, and temporal reasoning capabilities.

[0045] During the pre-training phase (i.e., the first phase), the model pre-training system 110 can acquire a pre-training dataset 112. The pre-training dataset 112 may include pre-training samples, which may include pre-training visual information, pre-training queries related to the pre-training visual information, and pre-training responses to the pre-training queries. As an example, the pre-training dataset 112 can be represented as... Where N is a positive integer, This represents a pre-training sample, where each pre-training sample... It can include pre-trained visual information Pre-training query and the corresponding pre-trained responses Pre-trained visual information This can include a single image, multiple images, or videos, etc. For example, pre-trained visual information. This can include multiple images arranged in chronological order, or, for example, a sequence of visual observations, used to encode the dynamic evolution of a 3D scene. In some cases, pre-trained responses... It can also be called a truth-value response. (This can also be expressed as, for example, a truth response) ).

[0046] In some examples, the model pre-training system 110 can use the machine learning model 105 to obtain a predicted response based on pre-trained visual information and a pre-trained query. Then, the model pre-training system 110 can update the parameters of the machine learning model 105 based on the difference between the predicted response and the true response to obtain a pre-trained machine learning model 105.

[0047] In some examples, the model pre-training system 110 can enable the machine learning model 105 to perform diverse spatial tasks based on pre-training samples to obtain predicted responses. The model pre-training system 110 can perform supervised fine-tuning training on the machine learning model 105 based on the difference between the predicted and ground truth responses. In this way, the spatial localization and basic perception capabilities of the machine learning model 105 can be trained in preparation for subsequent reinforcement learning.

[0048] As an example, for pre-trained samples At that time, the model pre-training system 110 can process pre-trained visual information. Perform feature encoding to obtain pre-trained feature representations. For example, the model pre-training system 110 can use pre-trained visual information Encoding into the latent space yields an implicit spatiotemporal representation (i.e., a pre-trained feature representation). This implicit spatiotemporal representation can capture the geometric properties (e.g., distance, orientation, occlusion, topology, etc.) and temporal dynamics (e.g., motion, interaction, and state transitions) of the pre-trained visual information. The model pre-training system 110 can utilize the machine learning model 105 based on the pre-trained feature representation. and pre-trained query Perform reasoning to obtain a predicted response For example, a text response or a voice response. Then, the model pre-training system 110 can use the predicted response... truth value response The differences between them are used to update the parameters of machine learning model 105.

[0049] In some examples, pre-training samples may include both short-format and long-format samples. The main difference between short-format and long-format samples lies in the truth response. The content included. Truth responses to short-format samples. This can include ground truth responses corresponding to pre-trained queries, such as object sizes in images or videos, distances between objects, relationships between objects, object locations, characters in images or videos (e.g., captions), object order, etc. Ground truth responses for long-format samples. This can include the truth value result corresponding to the pre-trained query and the truth logic derivation steps. The truth logic derivation steps describe the derivation process of deriving the truth value result from the pre-trained visual information. The specific details of the truth logic derivation steps will be elaborated below, please refer to the following introduction for details.

[0050] In the pre-training phase 102, the model pre-training system 110 can utilize the machine learning model 105 to obtain prediction results and prediction logic derivation steps based on pre-trained visual information (including pre-trained visual information in short-format and long-format samples). The prediction logic derivation steps describe the derivation process of deriving prediction results from pre-trained visual information. The model pre-training system 110 can update the machine learning model 105 based on the difference between the prediction results and the ground truth results. In this way, general multimodal understanding capabilities and specialized spatial perception tasks are synergistically enhanced. These refined responses highlight core localization skills, such as spatial localization, correspondence, and spatial correlation, thereby helping the model achieve accurate matching between visual observation and spatial semantics.

[0051] In some examples, the model pre-training system 110 can use visual labels extracted from images as conditional context. The model pre-training system 110 can then tune the parameters of the machine learning model using an autoregressive optimization objective function as shown below: (1) in Represents pre-trained visual information, Indicates a pre-training query. Indicates a truth value response. In the truth value response, the first Each word element, Indicates the first The weight of each word element, This indicates that machine learning model 105 generated the first... The probability of each word element. This represents the generated prefix sequence. Indicates a truth value response The length of the pre-training phase 102. In some examples, the pre-training phase 102 may be performed in one training cycle to inject spatial prior information without compromising generalization performance. Of course, the pre-training phase 102 may also be performed in multiple training cycles. The embodiments of this disclosure are not limited in this regard.

[0052] Figure 2B A schematic diagram of an example architecture 200B for training a machine learning model 105 according to some embodiments of the present disclosure is shown. Figure 2B As shown, example architecture 200B illustrates the training framework for the second stage (i.e., training stage 104). For ease of discussion, example architecture 200B will be described from the perspective of model training system 120, but this is merely exemplary.

[0053] like Figure 2B As shown, during the training phase 104, the model training system 120 can acquire training samples. The training samples may include sample visual information 202, sample queries 204 related to the sample visual information 202, and truth responses 206 to the sample queries 204. As an example, the training dataset 122 can be represented as... Where N is a positive integer, This represents a training sample, and each training sample... It can include visual information of the sample Sample query and the corresponding truth responses Sample visual information This can include a single image, multiple images, or videos, etc. For example, sample visual information. It may include multiple images arranged in chronological order, or, for example, a sequence of visual observations, used to encode the dynamic evolution of a 3D scene. In some cases, the pre-training dataset 112 and the training dataset 122 may be the same or different datasets.

[0054] like Figure 2B As shown, during the training phase 104, the model training system 120 can utilize the machine learning model 114 to be trained to determine a predicted response 208 for the sample query 204 based on the sample visual information 202 and the sample query 204. The model training system 120 can determine at least one of a first metric 210, a second metric 212, and a third metric 214 based at least on the true response 206 and the predicted response 208. The first metric 210 indicates the difference between the true response 206 and the predicted response 208 in the planar spatial dimension, the second metric 212 indicates the difference between the true response 206 and the predicted response 208 in the depth spatial dimension, and the third metric 214 indicates the difference between the true response 206 and the predicted response 208 in the temporal dimension.

[0055] In some examples, the model training system 120 can perform feature encoding on the sample visual information 202 to obtain a sample feature representation. Then, the model training system 120 can use the machine learning model 105 to determine the predicted response 208 based on the sample feature representation. It should be noted that the feature encoding process of the model training system 120 on the sample visual information 202 is similar to the feature encoding process of the model pre-training system 110 on the pre-training visual information; details can be found in the preceding description and will not be repeated here.

[0056] In some examples, the training samples include a first training sample, which includes first visual information, a first sample query, and a first truth response. The first visual information may include a first image and a second image, where the first image includes a first object and the second image includes multiple candidate objects. The object here may include pixels, points, entity objects (e.g., objects), or virtual objects in the images (e.g., the first or second image). The first sample query requests the object corresponding to the first object from among the multiple candidate objects. The first truth response indicates that the reference object be projected onto the projected object formed by the second image.

[0057] The model training system 120 can use the machine learning model 120 to determine a first predicted response based on a first image and a second image. The first predicted response can indicate the predicted object corresponding to the first object among multiple candidate objects. The model training system 120 can determine the error between the predicted object and the projected object based on the first true response and the first predicted response. Then, the model training system 120 can determine a first metric 210 based on the error. In some cases, updating the parameters of the machine learning model 105 based on the first metric 210 can also be referred to as an "XY reward mechanism".

[0058] Figure 3 A schematic diagram of example 300 of a first training sample according to some embodiments of the present disclosure is shown. Figure 3 As shown, the first training sample may include image 310 (i.e., the first image) and image 320 (i.e., the second image). Image 310 may include point 312 (i.e., the first object). Image 320 includes multiple points 322, 324, 326, and 328. The model training system 120 may use a machine learning model 120 to select the predicted point corresponding to point 312 from the multiple points 322, 324, 326, and 328.

[0059] As an example, the first training sample may include images. , (e.g., images 310 and 320), and the depth map corresponding to the image. , Camera internal parameters , and the transformation matrix from the camera coordinate system to the world coordinate system , The model training system 120 provides images. Reference pixels in and view The discrete candidate point set within can be represented, for example, as The discrete candidate point set can include, for example, candidate point locations. The model training system 120 can use the machine learning model 105 to select reference pixels from multiple candidate locations. The corresponding predicted point can be represented as follows: .

[0060] For images Reference pixels in The model training system 120 can be based on reference pixels. and reference pixel depth value The reference pixel is obtained using the formula shown below. Back projection onto the camera coordinate system: (2) in, In the image In the camera coordinate system and the reference pixel The corresponding three-dimensional space point, Representing an image The homogeneous pixel coordinates on the image plane.

[0061] The model training system 120 can use the following formula to train three-dimensional spatial points. Convert to world coordinate system and project onto image : (3) in , This represents a scale-invariant equivalence relation, where X represents the coordinate system relative to a point in three-dimensional space. The corresponding three-dimensional space point, In the image In the camera coordinate system and the three-dimensional space points The corresponding three-dimensional space point, Representing an image The homogeneous pixel coordinates on the image plane. Therefore, the projection pixels (i.e., the projection object) can be obtained in the image. Pixel coordinates: (4).

[0062] Model training system 120 can record points in three-dimensional space. In the image Depth value in the camera coordinate system The model training system 120 can determine the reference pixel using the formula shown below. After 3D reprojection, in the image Did a valid, comparable correspondence emerge in the data? (5) in This indicates a validity indicator function. If the depth value... Greater than zero, and projected pixels Falling image Within range, meaning reference pixel In the image A valid point is formed within this. Otherwise, it means the reference pixel... Unable to display in image A valid point is formed in the middle, and zero weight is assigned to it.

[0063] The model training system 120 can determine the predicted points using the formula shown below. With projected pixels The error between them (also known as the distance): (6) in Representing an image The size of the error. Then, the model training system 120 can determine the first metric 210 based on this error.

[0064] In some embodiments, the model training system 120 may determine a cutoff term for a first predicted response based on whether the error exceeds a threshold, the cutoff term indicating the availability of a reward value. The model training system 120 may then determine a reward value for the first predicted response based on the error and the cutoff term. Subsequently, the model training system 120 may determine a first metric based on the reward value.

[0065] As an example, the model training system 120 can determine the distance-aware soft reward (i.e., the reward value) based on the following formula: (7) in This indicates the control tolerance (i.e., the threshold). This indicates a cutoff term (also known as a hard cutoff condition). If the distance... A deviation exceeding three times the control tolerance means the predicted point is far from the projected pixel, and the cutoff term is set to zero. If the distance... A distance not exceeding three times the control tolerance means the distance between the predicted point and the projected pixel is acceptable, and the cutoff term is set to 1. The hard cutoff condition prevents rewards for guesses that deviate significantly from the target and stabilizes reinforcement learning by suppressing spurious gradients caused by severely erroneous correspondences.

[0066] In some examples, the model training system 120 can determine the overlap mask corresponding to the projected object based on whether the depth coordinates of the projected object fall within the visible depth range of the second image. Then, the model training system 120 can determine a first metric based on the error and the overlap mask. Simply performing reprojection may still fail to capture geometrically similar objects that are obscured by occlusion in the image. The points displayed in the image are rewarded. To ensure physical plausibility, we perform a depth consistency check on the image. An overlapping mask is constructed. For each valid reprojection result, the model training system 120 determines whether the projected pixel is visible based on the following formula: (8) in This represents the depth threshold. The model training system 120 can determine the depth threshold from the image. Read the actual depth of the projected pixels from the depth map The model training system 120 can determine the actual depth. With projection value The system checks whether the absolute value of the difference between the two values is less than a depth threshold. If the absolute value is less than or equal to the depth threshold, the projected pixel is visible; if the absolute value is greater than the depth threshold, the projected pixel is not visible. The model training system 120 can determine the overlapping mask based on the visibility determination result of the projected pixels. Used to identify images under a given camera configuration. The real image Visible pixels. This overlapping mask is derived through the same reprojection process as calculating the projected pixels. Afterwards, the model training system 120 can determine the first representation 210 using the following formula: (9) in This represents the primary indicator. This overlapping gating mechanism transforms rewards into visibility-aware signals: even when predicting locations... Near projected pixels Any point outside the depth-consistent hyperview region will receive zero reward, effectively suppressing correspondence matching in occluded or non-overlapping regions. Overall, the XY reward mechanism explicitly abandons appearance-based heuristics, instead promoting cross-view alignment that meets both reprojection consistency and visibility requirements. In multi-view spatial reasoning tasks, models should not merely "guess" correspondences based on two-dimensional appearance features. Correct correspondences must meet geometric requirements: the predicted point in the target view must be consistent with the reprojection result of the reference point under camera intrinsics, pose parameters, and depth information. Therefore, the XY reward mechanism, by enforcing reprojection consistency and overlap validity, directly supervises the establishment of two-dimensional correspondences, providing dense, physically-based guidance for the reinforcement learning process.

[0067] In some examples, the training samples may include second training samples, which include second visual information, a second sample query, and a second ground truth response. The second visual information may be a third image, which includes multiple second objects. The second sample query request determines a predicted ranking of the multiple second objects in the depth direction of the third image, and the second ground truth response indicates a ground truth ranking of the multiple second objects in the depth direction of the third image. The model training system 120 may use a machine learning model 105 to determine a second predicted response based on the third image, which indicates a predicted ranking of the multiple second objects in the depth direction of the third image. Subsequently, the model training system 120 may determine a second metric 212 based on the difference between the predicted ranking and the ground truth ranking.

[0068] Figure 4 A schematic diagram of an example 400 of a second training sample according to some embodiments of the present disclosure is shown. Figure 4 As shown, the second training sample may include image 410, which may include chair 412, table 414, and wardrobe 416. The second sample query may request a determination of the order of chair 412, table 414, and wardrobe 416 in the depth direction of image 410. The model training system 120 may determine a second predicted response based on image 410, which may indicate the predicted order of chair 412, table 414, and wardrobe 416 in the depth direction of image 410.

[0069] In some examples, the model training system 120 divides multiple second objects into multiple object combinations, each of which includes at least two second objects. The model training system 120 then determines a first-class object combination and a second-class object combination within the multiple object combinations based on prediction ranking and ground truth ranking. At least two second objects in the first-class object combination are in the same order in both prediction and ground truth ranking, while at least two second objects in the second-class object combination are not in the same order. The model training system 120 then determines a second metric based on a first number of first-class object combinations and a second number of second-class object combinations. In some cases, updating the parameters of the machine learning model 120 based on the second metric can be termed, for example, a "Z-reward mechanism."

[0070] As an example, the model training system 120 can determine a second predicted response based on a third image, the second predicted response including a predicted depth sequence. The predicted depth sequence can indicate the predicted depth information of n third objects in the third image, such as predicted 3D bounding boxes. The second ground truth response can include the ground truth depth sequence. The ground truth depth sequence can indicate the ground truth depth information of n third objects, such as ground truth 3D bounding boxes.

[0071] The model training system 120 can match predicted 3D bounding boxes with ground truth 3D bounding boxes using, for example, the Hungarian assignment algorithm based on the cost function of the 3D generalized intersection-union ratio (GIoU). The model system 120 can employ Kendall Jenner's algorithm. The rank correlation coefficient assesses whether the ranking of predicted 3D bounding boxes matches the ranking of ground truth 3D bounding boxes. Specifically, for each pair of objects among n third objects... (That is, object composition), where The model training system can determine whether the following conditions are met: (10) in Representation Object Predicting 3D bounding boxes, Representation Object Predicting 3D bounding boxes, Representation Object The true 3D bounding box, Representation Object The true three-dimensional bounding box. If the conditions shown in formula (10) are met, the model training system 120 determines the pair of objects. For a consistent pairing (i.e., a combination of objects of the first type), if the conditions shown in formula (10) are not met, the model training system 120 determines that the pair of objects is not a valid pair. This is an inconsistent pairing (i.e., a combination of second-class objects).

[0072] The model training system 120 can determine the Kendall coefficients based on the formula shown below. : (11) in This indicates the number of identical pairs (i.e., the first number). This represents the number of inconsistent pairings (i.e., the second number). Kendall's coefficient. Kendall coefficients are used to measure the consistency between predicted depth ranking and true depth ranking. The model training system 120 can use the following formula to calculate the Kendall coefficients. Normalize to the [0,1] interval to obtain the second indicator 212 (i.e., the depth ranking reward value): (12) in This represents the second metric. After establishing the model's basic localization capabilities during the supervised fine-tuning phase, it was found that further optimization of localization during reinforcement learning did not provide additional improvement for spatial visual question answering. The core challenge lies not in object detection itself, but in how to infer the relative depth relationships between objects in three-dimensional space from two-dimensional observation data. The Z-reward mechanism was used to enhance the machine learning model 105's ability to recover the relative depth structure of the scene, thereby optimizing the accuracy of depth ranking.

[0073] In some examples, the training samples include a third training sample, which includes a first image sequence, a second image sequence, a positive sample query corresponding to the first image sequence, an inverse sample query corresponding to the second image sequence, a third ground truth response for the positive sample query, and a fourth ground truth response for the inverse sample query. The first image sequence includes multiple fourth images arranged in chronological order, and the second image sequence includes multiple fourth images arranged in reverse chronological order. The model training system 120 can use the machine learning model 105 to determine a third predicted response for the positive sample query based on the first image sequence. The model training system 120 can use the machine learning model 105 to determine a fourth predicted response for the inverse sample query based on the second image sequence. The model training system 120 can determine a first accuracy of the third predicted response relative to the third ground truth response, and the model training system 120 can determine a second accuracy of the fourth predicted response relative to the fourth ground truth response. Then, the model training system 120 can determine a third metric based on the first and second accuracies. In some cases, updating the parameters of the machine learning model 105 based on the third metric can be called a "T-reward mechanism". To explicitly strengthen the time dimension feature, the T-reward mechanism can be used to ensure cyclic consistency, that is, the reasoning process from the initial perspective to the final perspective must be logically reversible, and can still maintain logical consistency even when making reverse queries.

[0074] Figure 5 A schematic diagram of an example 500 of a third training sample according to some embodiments of the present disclosure is shown. Figure 5 As shown, the first image sequence may include images 510 and 520 arranged in sequence, and the second image sequence may include images 520 and 510 arranged in sequence. Both the positive and negative sample queries request that the camera's motion direction be "left turn" or "right turn". The third truth response can be represented, for example, as... The fourth truth-valued response can be represented, for example, as: The model training system 120 can use the machine learning model 105 to determine a third predicted response based on the first image sequence, for example, it can be represented as... The model training system 120 can use the machine learning model 105 to determine a fourth predicted response based on the second image sequence, for example, it can be represented as... Then, the model training system 120 can determine the third metric 214 based on the following formula: (13) in This indicates the third indicator. The T-reward mechanism requires the machine learning model 105 to maintain logical invertibility between forward and backward motion sequences, supplementing the accuracy reward and thus promoting a robust understanding of self-motion and temporal causality.

[0075] Return to combination Figure 2B As shown, during training phase 104, model training system 120 can update the parameters of machine learning model 105 based on at least one of a first metric, a second metric, and a third metric to obtain a trained machine learning model 105. In some examples, model training system 120 can determine a loss function based on the first metric, the second metric, the third metric, and the weights corresponding to the first metric, the second metric, and the third metric, respectively. Then, model training system 120 can update the parameters of machine learning model 120 based on the loss function to obtain a trained machine learning model 120. In this way, model training system 120 can employ reinforcement learning (RL) techniques to further improve the spatial reasoning ability of machine learning model 105.

[0076] In some examples, the truth response includes a truth result for the sample query and a truth logic derivation step corresponding to the truth result, which describes the derivation process of deriving the truth result from the sample visual information. The model training system 120 can use the machine learning model 105 to determine a predictive response based on the sample visual information, including a prediction result and a prediction logic derivation step, which describes the derivation process of deriving the prediction result from the sample visual information. The model training system 120 can determine the third accuracy of the prediction result relative to the truth result. Subsequently, the model training system 120 can determine the matching degree of the prediction logic derivation step relative to the truth logic derivation step.

[0077] As an example, continue to combine Figure 5 As shown, the truth value result can indicate whether the camera should move left or right. The truth logic derivation steps can include steps 1, 2, and 3. Step 1: Anchor point positioning: In the first frame image 510, the main anchor point objects are the wardrobe 512 in the background, the table 514 to the left of the wardrobe 512, and the bed 516 to the right of the wardrobe 512. The initial position of the camera makes the left wall 518 of the room closer to the viewer than the right wall 522, thus creating a clear depth and perspective effect.

[0078] Step 2: Transition: Upon moving to the second frame image 520, the camera position subtly changes. The wardrobe 512, originally located on the right side of image 510, is now in the center of image 520; the table, originally in the center of image 510, moves to the left side of image 520; and the bed 516, originally on the right side of image 510, moves closer to the center of image 520. This adjustment in the relative positions of the wardrobe 512, table 514, and bed 516 indicates that the camera has moved to the right relative to the scene. The perspective of the table 514 remains stable, confirming that the camera movement is a lateral displacement rather than a rotational motion.

[0079] Step 3: Verification Analysis: The motion from the first frame image 510 to the second frame image 520 is a rightward translation. This displacement causes the wardrobe 512 and table 514 to appear to shift to the left in the frame, which is consistent with the expectation that the camera is moving to the right. If the camera were moving to the left, the wardrobe 512, table 514, and bed 516 would appear to shift to the right while the jar would shift to the left, but this phenomenon was not actually observed. Therefore, the parallax displacement evidence between the wardrobe 512, table 514, and bed 516 confirms the rightward movement of the camera.

[0080] Model training system 120 can determine accuracy rewards based on a third level of accuracy: (14) in Let y represent the accuracy reward and y represent the true value result. Indicates the prediction result. Indicates an indicator function.

[0081] The model training system 120 can determine the format reward based on the matching degree between the predictive logic derivation steps and the truth logic derivation steps. Afterwards, the model training system 120 can utilize format constraints. Determine the fourth metric. In this way, the output of machine learning model 105 can be strictly constrained to conform to format constraints. This ensures the integrity of the model's output structure. Therefore, performance can be improved by iteratively optimizing the model's thought process based on the "anchor-transfer-validation" framework.

[0082] In some examples, model training system 120 can update the parameters of machine learning model 105 based on at least one of the following: third accuracy and matching degree, to obtain trained machine learning model 105. For example, model training system 120 can determine the total reward function based on the formula shown below. : (15) in These represent the weights corresponding to the accuracy reward, the first metric, the second metric, and the third metric, respectively. By decomposing the reward space into explicit spatial, depth, and temporal dimensions, our reinforcement learning framework effectively achieves a shift from simple pattern matching to a deeper, physics-based understanding of four-dimensional scenes.

[0083] In summary, a decompositional reinforcement learning method was adopted, which decomposes spatial reasoning into three complementary dimensions: planar spatial dimension, depth spatial dimension, and temporal dimension. This method overcomes the problems of traditional visual language models, which are prone to illusions in dynamic spatial reasoning and tend to infer spatial relationships heuristically rather than based on depth information or temporal consistency. It can effectively recover dimensional information folded by camera projection, which is beneficial to improving the accuracy and consistency of machine learning models in three-dimensional and four-dimensional spatial reasoning tasks.

[0084] Figure 6 A flowchart of a process 600 for training a machine learning model according to some embodiments of the present disclosure is shown. Process 600 may be implemented at model training system 120 or model pre-training system 110. Reference is made below. Figure 1 To describe process 600.

[0085] In box 610, the model training system 120 acquires training samples, which include sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries.

[0086] In box 620, model training system 120 uses the machine learning model to be trained to determine a predicted response to a sample query based on sample visual information.

[0087] In box 630, the model training system 120 determines at least one of a first metric, a second metric, and a third metric based at least on the true response and the predicted response. The first metric indicates the difference between the true response and the predicted response in the planar spatial dimension, the second metric indicates the difference between the true response and the predicted response in the depth spatial dimension, and the third metric indicates the difference between the true response and the predicted response in the time dimension.

[0088] In box 640, the model training system 120 updates the parameters of the machine learning model based on at least one of the first, second, and third metrics to obtain a trained machine learning model.

[0089] In some examples, the training samples include a first training sample, which includes a first image and a second image. The first image includes a first object, and the second image includes a plurality of candidate objects. Determining a predicted response includes: using a machine learning model to determine a first predicted response based on the first image and the second image. The first predicted response indicates a predicted object among the plurality of candidate objects that corresponds to the first object.

[0090] In some examples, the first training sample also includes a first ground truth response, the first ground truth indicating the projected object formed by projecting the first object onto the second image, and wherein determining the first metric includes: determining the error between the predicted object and the projected object based on the first ground truth response and the first predicted response; and determining the first metric based on the error.

[0091] In some examples, determining the first metric based on error includes: determining a cutoff item for a first predicted response based on whether the error exceeds a threshold, the cutoff item indicating the validity of the first predicted response; determining a reward value for the first predicted response based on the error and the cutoff item; and determining the first metric based on the reward value.

[0092] In some examples, determining the first metric based on the error includes: determining the overlap mask corresponding to the projected object based on whether the depth coordinates of the projected object fall within the visible depth range of the second image; and determining the first metric based on the error and the overlap mask.

[0093] In some examples, the training samples include a second training sample, the second training sample includes a third image, the third image includes a plurality of second objects, and determining the predicted response includes: using a machine learning model based on the third image to determine a second predicted response, the second predicted response indicating the predicted order of the plurality of second objects in the depth direction of the third image.

[0094] In some examples, the second training samples include a second ground truth response, which indicates the ground truth ranking of multiple second objects in the depth direction of the third image, and wherein determining the second metric includes determining the second metric based on the difference between the predicted ranking and the ground truth ranking.

[0095] In some examples, determining a second metric based on the difference between predicted ranking and true ranking includes: dividing a plurality of second objects into a plurality of object combinations, each of the plurality of object combinations including at least two second objects; determining a first type of object combination and a second type of object combination among the plurality of object combinations based on predicted ranking and true ranking, wherein at least two second objects in the first type of object combination are in the same order in both predicted ranking and true ranking, and at least two second objects in the second type of object combination are not in the same order in both predicted ranking and true ranking; and determining a second metric based on a first number of first type of object combinations and a second number of second type of object combinations.

[0096] In some examples, the training samples include a third training sample, which includes a first image sequence, a second image sequence, a positive sample query corresponding to the first image sequence, and an inverse sample query corresponding to the second image sequence. The first image sequence includes a plurality of fourth images arranged in chronological order, and the second image sequence includes a plurality of fourth images arranged in reverse chronological order. Determining the predicted response includes: using a machine learning model based on the first image sequence to determine a third predicted response for the positive sample query; and using a machine learning model based on the second image sequence to determine a fourth predicted response for the inverse sample query.

[0097] In some examples, the third training samples include a third ground truth response for a positive sample query and a fourth ground truth response for an inverse sample query, and wherein determining the third metric includes: determining a first accuracy of the third predicted response relative to the third ground truth response; determining a second accuracy of the fourth predicted response relative to the fourth ground truth response; and determining the third metric based on the first accuracy and the second accuracy.

[0098] In some examples, the truth response includes a truth result for a sample query and a truth logic derivation step corresponding to the truth result. The truth logic derivation step describes the derivation process of deriving the truth result from the sample visual information. Determining the predictive response includes: using a machine learning model based on the sample visual information to determine a predictive response that includes a predictive result and a predictive logic derivation step, which describes the derivation process of deriving the predictive result from the sample visual information.

[0099] In some examples, process 600 further includes: determining a third accuracy of the prediction result relative to the true result; and determining the matching degree of the prediction logic derivation step relative to the true logic derivation step; and wherein updating the parameters of the machine learning model includes: updating the parameters of the machine learning model based on at least one of the third accuracy and the matching degree to obtain a trained machine learning model.

[0100] In some examples, determining the predicted response involves: performing feature encoding on the visual information of the sample to obtain a sample feature representation; and using a machine learning model to determine the predicted response based on the sample feature representation.

[0101] In some examples, process 600 further includes: obtaining pre-trained samples, which include pre-trained visual information, pre-trained queries related to the pre-trained visual information, and pre-trained responses to the pre-trained queries; and performing pre-training on a machine learning model based on the pre-trained samples to obtain a machine learning model to be trained.

[0102] Figure 7A block diagram of an apparatus 700 for training a machine learning model 105 according to some embodiments of the present disclosure is shown. The apparatus 700 may be implemented as or included in a model pre-training system 110 or a model training system 120.

[0103] The apparatus 700 includes: an acquisition module 710 configured to acquire training samples, the training samples including sample visual information, sample queries related to the sample visual information, and true responses to the sample queries; a first determination module 720 configured to determine a predicted response to the sample queries based on the sample visual information using a machine learning model to be trained; a second determination module 730 configured to determine at least one of a first metric, a second metric, and a third metric based at least on the true responses and the predicted responses, the first metric indicating the difference between the true responses and the predicted responses in a planar spatial dimension, the second metric indicating the difference between the true responses and the predicted responses in a depth spatial dimension, and the third metric indicating the difference between the true responses and the predicted responses in a temporal dimension; and an update module 740 configured to update the parameters of the machine learning model based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

[0104] In some examples, the training samples include a first training sample, which includes a first image and a second image. The first image includes a first object, and the second image includes multiple candidate objects. The first determining module 720 is further configured to: determine a first predictive response based on the first image and the second image using a machine learning model. The first predictive response indicates a predicted object among the multiple candidate objects that corresponds to the first object.

[0105] In some examples, the first training sample also includes a first truth response, the first truth indicating the projected object formed by projecting the first object onto the second image, and wherein the second determination module 730 is further configured to: determine the error between the predicted object and the projected object based on the first truth response and the first prediction response; and determine a first index based on the error.

[0106] In some examples, the second determining module 730 is further configured to: determine a cutoff item for the first predicted response based on whether the error exceeds a threshold, the cutoff item indicating the validity of the first predicted response; determine a reward value for the first predicted response based on the error and the cutoff item; and determine a first indicator based on the reward value.

[0107] In some examples, the second determining module 730 is further configured to: determine an overlap mask corresponding to the projected object based on whether the depth coordinates of the projected object fall within the visible depth range of the second image; and determine a first index based on the error and the overlap mask.

[0108] In some examples, the training samples include a second training sample, the second training sample includes a third image, the third image includes a plurality of second objects, and wherein the first determining module 720 is further configured to: determine a second predictive response based on the third image using a machine learning model, the second predictive response indicating a predicted order of the plurality of second objects in the depth direction of the third image.

[0109] In some examples, the second training samples include a second ground truth response, which indicates the ground truth ranking of multiple second objects in the depth direction of the third image, and wherein the second determination module 730 is further configured to determine a second metric based on the difference between the predicted ranking and the ground truth ranking.

[0110] In some examples, the second determining module 730 is further configured to: divide a plurality of second objects into a plurality of object combinations, each of the plurality of object combinations including at least two second objects; determine a first type of object combination and a second type of object combination among the plurality of object combinations based on prediction ranking and truth ranking, wherein at least two second objects in the first type of object combination are in the same order in prediction ranking and truth ranking, and at least two second objects in the second type of object combination are not in the same order in prediction ranking and truth ranking; and determine a second index based on a first number of the first type of object combination and a second number of the second type of object combination.

[0111] In some examples, the training samples include a third training sample, which includes a first image sequence, a second image sequence, a positive sample query corresponding to the first image sequence, and an inverse sample query corresponding to the second image sequence. The first image sequence includes a plurality of fourth images arranged in chronological order, and the second image sequence includes a plurality of fourth images arranged in reverse chronological order. The first determining module 720 is further configured to: determine a third predicted response for the positive sample query based on the first image sequence using a machine learning model; and determine a fourth predicted response for the inverse sample query based on the second image sequence using a machine learning model.

[0112] In some examples, the third training samples include a third truth response to a positive sample query and a fourth truth response to an inverse sample query, and wherein the second determination module 730 is further configured to: determine a first accuracy of the third predicted response relative to the third truth response; determine a second accuracy of the fourth predicted response relative to the fourth truth response; and determine a third metric based on the first accuracy and the second accuracy.

[0113] In some examples, the truth response includes a truth result for a sample query and a truth logic derivation step corresponding to the truth result. The truth logic derivation step describes the derivation process of deriving the truth result from the sample visual information. The first determining module 720 is further configured to: use a machine learning model to determine a predictive response based on the sample visual information, including a predictive result and a predictive logic derivation step, which describes the derivation process of deriving the predictive result from the sample visual information.

[0114] In some examples, the apparatus 700 further includes: a third determining module configured to determine a third accuracy of the prediction result relative to the true result; and to determine the matching degree of the prediction logic derivation step relative to the true logic derivation step; and wherein the updating module 740 is further configured to: update the parameters of the machine learning model based on at least one of the third accuracy and the matching degree to obtain a trained machine learning model.

[0115] In some examples, the first determining module 720 is further configured to: perform feature encoding on the sample visual information to obtain a sample feature representation; and utilize a machine learning model to determine a predicted response based on the sample feature representation.

[0116] In some examples, the apparatus 700 further includes: a pre-training module configured to acquire pre-training samples, the pre-training samples including pre-training visual information, pre-training queries related to the pre-training visual information, and pre-training responses to the pre-training queries; and to perform pre-training on a machine learning model based on the pre-training samples to obtain a machine learning model to be trained.

[0117] The modules included in device 700 can be implemented in various ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units can be implemented using software and / or firmware, such as machine-executable instructions stored on a storage medium. In addition to or as an alternative to machine-executable instructions, some or all of the units in device 700 can be implemented at least partially by one or more hardware logic components. By way of example and not limitation, exemplary types of hardware logic components that can be used include field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-chips (SoCs), complex programmable logic devices (CPLDs), and so on.

[0118] Figure 8 A block diagram of an electronic device 800 in which one or more embodiments of the present disclosure may be implemented is shown. It should be understood that... Figure 8 The electronic device 800 shown is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein.

[0119] like Figure 8 As shown, electronic device 800 is in the form of a general-purpose electronic device. Components of electronic device 800 may include, but are not limited to, one or more processors or processing units 810, memory 820, storage device 830, one or more communication units 840, one or more input devices 850, and one or more output devices 860. Processing unit 810 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 820. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of electronic device 800.

[0120] Electronic device 800 typically includes multiple computer storage media. Such media can be any available media accessible to electronic device 800, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 820 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 830 can be removable or non-removable media and can include machine-readable media, such as flash drives, disks, or any other media capable of storing information and / or data and accessible within electronic device 800.

[0121] Electronic device 800 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 8 As shown, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces. Memory 820 may include computer program product 825 having one or more program modules configured to perform various methods or actions of various embodiments of this disclosure.

[0122] The communication unit 840 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 800 can be implemented using a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the electronic device 800 can operate in a networked environment using logical connections to one or more other servers, networked personal computers (PCs), or another network node.

[0123] Input device 850 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 860 can be one or more output devices, such as a monitor, speaker, printer, etc. Electronic device 800 can also communicate with one or more external devices (not shown) via communication unit 840 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with electronic device 800, or with any device that enables electronic device 800 to communicate with one or more other electronic devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).

[0124] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transient computer-readable medium and includes computer-executable instructions that are executed by a processor to implement the methods described above.

[0125] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0126] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0127] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0128] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0129] Various implementations of this disclosure have been described above. The foregoing description is exemplary and not exhaustive, nor is it limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the implementations disclosed herein.

Claims

1. A method for training a machine learning model, comprising: Obtain training samples, which include sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries; Using the machine learning model to be trained, a predicted response to the query for the sample is determined based on the visual information of the sample. Based at least on the truth response and the predicted response, determine at least one of a first metric, a second metric, and a third metric, wherein the first metric indicates the difference between the truth response and the predicted response in a planar spatial dimension, the second metric indicates the difference between the truth response and the predicted response in a depth spatial dimension, and the third metric indicates the difference between the truth response and the predicted response in a time dimension; and The parameters of the machine learning model are updated based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

2. The method of claim 1, wherein the training samples include a first training sample, the first training sample includes a first image and a second image, the first image includes a first object, the second image includes a plurality of candidate objects, and wherein determining the predicted response includes: Using the machine learning model, a first predicted response is determined based on the first image and the second image. The first predicted response indicates the predicted object among the plurality of candidate objects that corresponds to the first object.

3. The method of claim 2, wherein the first training sample further comprises a first truth response, the first truth indicating a projected object formed by projecting the first object onto the second image, and wherein determining the first metric comprises: Based on the first truth response and the first prediction response, the error between the prediction object and the projection object is determined; as well as Based on the error, the first index is determined.

4. The method according to claim 3, wherein determining the first index based on the error comprises: Based on whether the error exceeds a threshold, a cutoff item is determined for the first predicted response, the cutoff item indicating the validity of the first predicted response; Based on the error and cutoff term, determine the reward value for the first predicted response; as well as The first indicator is determined based on the reward value.

5. The method of claim 3, wherein determining the first index based on the error comprises: Based on whether the depth coordinates of the projected object fall within the visible depth range of the second image, an overlap mask corresponding to the projected object is determined; as well as The first index is determined based on the error and the overlap mask.

6. The method of claim 1, wherein the training samples include a second training sample, the second training sample includes a third image, the third image includes a plurality of second objects, and wherein determining the predicted response includes: Using the machine learning model based on the third image, a second predicted response is determined, the second predicted response indicating the predicted order of the plurality of second objects in the depth direction of the third image.

7. The method of claim 6, wherein the second training sample includes a second truth response, the second truth response indicating the truth ranking of the plurality of second objects in the depth direction of the third image, and wherein determining the second metric includes: The second index is determined based on the difference between the predicted ranking and the true ranking.

8. The method of claim 7, wherein determining the second index based on the difference between the predicted ranking and the true ranking comprises: The plurality of second objects are divided into a plurality of object combinations, and each object combination includes at least two second objects; Based on the predicted sorting and the true value sorting, a first type of object combination and a second type of object combination are determined among the plurality of object combinations, wherein at least two second objects in the first type of object combination have the same order in the predicted sorting and the true value sorting, and at least two second objects in the second type of object combination have different orders in the predicted order and the true value sorting. as well as The second index is determined based on the first number of combinations of the first type of objects and the second number of combinations of the second type of objects.

9. The method of claim 1, wherein the training samples include a third training sample, the third training sample including a first image sequence, a second image sequence, a positive sample query corresponding to the first image sequence, and an inverse sample query corresponding to the second image sequence, the first image sequence including a plurality of fourth images arranged in chronological order, the second image sequence including the plurality of fourth images and the plurality of fourth images arranged in reverse chronological order, and wherein determining the predicted response includes: Using the machine learning model based on the first image sequence, a third predicted response is determined for the positive sample query; as well as The machine learning model is used to determine a fourth predicted response to the inverse sample query based on the second image sequence.

10. The method of claim 9, wherein the third training sample comprises a third truth response to the positive sample query and a fourth truth response to the inverse sample query, and wherein determining the third metric comprises: Determine the first accuracy of the third predicted response relative to the third truth response; Determine the second accuracy of the fourth predicted response relative to the fourth truth response; as well as The third indicator is determined based on the first accuracy and the second accuracy.

11. The method of claim 1, wherein the truth response includes a truth result for the sample query and a truth logic derivation step corresponding to the truth result, the truth logic derivation step being used to describe the derivation process of deriving the truth result from the visual information of the sample, and wherein determining the predicted response includes: The machine learning model is used to determine the prediction response, which includes a prediction result and a prediction logic derivation step, based on the visual information of the sample. The prediction logic derivation step describes the derivation process of deriving the prediction result from the visual information of the sample.

12. The method of claim 11, further comprising: Determine the third accuracy of the prediction result relative to the true value result; as well as Determine the degree of matching between the predictive logic derivation step and the truth logic derivation step; and The parameters for updating the machine learning model include: The parameters of the machine learning model are updated based on at least one of the above, the third accuracy, and the matching degree to obtain the trained machine learning model.

13. The method of claim 1, wherein determining the predicted response comprises: Perform feature encoding on the visual information of the samples to obtain sample feature representations; as well as The predicted response is determined using the machine learning model based on the sample feature representation.

14. The method of claim 1, further comprising: Obtain pre-training samples, which include pre-training visual information, pre-training queries related to the pre-training visual information, and pre-training responses to the pre-training queries; as well as The machine learning model is pre-trained based on the pre-trained samples to obtain the machine learning model to be trained.

15. An apparatus for training a machine learning model, comprising: The acquisition module is configured to acquire training samples, which include sample visual information, sample queries related to the sample visual information, and truth responses to the sample queries. The first determining module is configured to use a machine learning model to be trained to determine a predicted response to the sample query based on the sample visual information. The second determining module is configured to determine at least one of a first indicator, a second indicator, and a third indicator based at least on the truth response and the predicted response, wherein the first indicator indicates the difference between the truth response and the predicted response in a planar spatial dimension, the second indicator indicates the difference between the truth response and the predicted response in a depth spatial dimension, and the third indicator indicates the difference between the truth response and the predicted response in a time dimension. as well as The update module is configured to update the parameters of the machine learning model based on at least one of the first metric, the second metric, and the third metric to obtain a trained machine learning model.

16. An electronic device comprising: At least one processing unit; as well as At least one memory is coupled to at least one processing unit and stores instructions for execution by the at least one processing unit, which, when executed by the at least one processing unit, cause the electronic device to perform the method according to any one of claims 1 to 14.

17. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 14.

18. A computer program product tangibly stored in a computer storage medium and comprising computer-executable instructions that, when executed by a device, cause the device to perform the method according to any one of claims 1 to 14.