An organ autonomous scanning method based on a VLA model, a related method and device

By using a VLA model-based autonomous scanning method, a trained VLA model is used to drive an ultrasound robot, solving the problem of traditional ultrasound scanning relying on doctors' experience. This achieves efficient organ scanning, reduces the burden on doctors, and improves the efficiency of medical resource utilization.

CN122201722APending Publication Date: 2026-06-12武汉库柏特科技股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
武汉库柏特科技股份有限公司
Filing Date
2026-03-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional ultrasound scanning procedures rely on doctors' experience, are time-consuming and consume medical resources, making it difficult to efficiently complete organ scans.

Method used

By constructing an autonomous scanning method based on the VLA model, the trained VLA model drives the ultrasound robot. Supervised fine-tuning training is performed by combining ultrasound images, text descriptions, and quality scores to generate a model that can autonomously complete scanning.

🎯Benefits of technology

It reduced the workload of ultrasound doctors, improved the efficiency of medical resource utilization, and enabled the efficient completion of autonomous scanning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201722A_ABST
    Figure CN122201722A_ABST
Patent Text Reader

Abstract

The application relates to an organ autonomous scanning method based on a VLA model, a related method and a device. The method comprises the following steps: constructing a first data set composed of an ultrasound image, a text description and a quality score, performing supervised fine-tuning training on a multi-modal large model to obtain a multi-modal basic model injected with medical ultrasound expert knowledge, taking an ultrasound image sequence, an end pose increment sequence and a contact force increment sequence of a current historical moment as a time observation sequence, taking an end pose increment and a contact force increment of a next historical moment as a supervision label, constructing a VLA model training sample pair as a second data set, taking the multi-modal basic model as an initial model of the VLA model, training based on the second data set to obtain a trained VLA model, and driving an ultrasound robot to complete autonomous scanning of an organ based on the time observation sequence of the current moment and the trained VLA model. The method can reduce the work pressure of an ultrasound doctor and effectively improve the utilization efficiency of medical resources.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method and apparatus for autonomous organ scanning based on a VLA model. Background Technology

[0002] Ultrasound scanning is a routine medical examination technique that uses ultrasound waves to perform non-invasive imaging of internal organs. Ultrasound scans are greatly affected by individual differences; factors such as rib arch obstruction, body shape and fat layer thickness, respiratory fluctuations, and changes in probe contact can all directly alter the image appearance. In traditional ultrasound scan procedures, doctors need to repeatedly adjust between different entrance windows and scanning directions to gradually cover multiple areas and locate and confirm key structures. This process is often time-consuming and highly dependent on the doctor's experience and skill. Summary of the Invention

[0003] To alleviate the workload of ultrasound physicians and effectively improve the utilization efficiency of medical resources, this invention provides a method, related methods, and apparatus for autonomous organ scanning based on a VLA model.

[0004] In a first aspect, the present invention provides a method for autonomous organ scanning based on a VLA model, the method comprising:

[0005] Based on the time observation sequence at the current moment, the trained VLA model is used to drive the ultrasound robot to complete the autonomous scanning of organs.

[0006] The trained VLA model is obtained by training in the following manner:

[0007] The first dataset is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores.

[0008] Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge;

[0009] Obtain the second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method:

[0010] Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset.

[0011] For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair.

[0012] The multimodal base model is used as the initial model for the VLA model. Based on the second dataset, the VLA model is trained to obtain the trained VLA model.

[0013] In one or more optional embodiments of this application, based on the time observation sequence at the current moment, a trained VLA model is used to drive an ultrasound robot to autonomously scan organs, including:

[0014] The current time observation sequence is input into the trained VLA model, and the trained VLA model outputs the end-effector pose increment and contact force increment at the next time step; the current time observation sequence includes the current ultrasound image sequence, the end-effector pose increment sequence, and the contact force increment sequence.

[0015] Based on the end-effector pose increment and contact force increment at the next moment, the ultrasonic robot is driven to perform autonomous scanning at the next moment, and the ultrasonic image, end-effector pose increment and contact force increment at the next moment are collected simultaneously to obtain the time observation sequence at the next moment.

[0016] Based on the time observation sequence of the next moment, the trained VLA model is used to iteratively output the new end pose increment and contact force increment of the next moment and drive the ultrasound robot to perform autonomous scanning of the next moment until the autonomous scanning of the organ is completed.

[0017] In one or more optional embodiments of this application, based on the first dataset, a multimodal large model is subjected to supervised fine-tuning training to obtain a multimodal basic model infused with medical ultrasound expert knowledge, including:

[0018] The ultrasound images and corresponding question commands from the first dataset are input into the multimodal large model. Supervised fine-tuning training is performed using the following total loss function. When the predicted text output by the multimodal large model is consistent with the text description in the first dataset, and the output predicted score is consistent with the quality score in the first dataset, a multimodal basic model infused with medical ultrasound expert knowledge is obtained.

[0019] ,

[0020] In the formula, This represents the total loss of a large multimodal model. This represents the loss based on the text description. This represents the loss based on the quality score. and These represent the weighting coefficients based on text description and the weighting coefficients based on quality scores, respectively.

[0021] The questioning instructions include requesting the multimodal large model to provide a textual description of the ultrasound image and give a quality score.

[0022] In one or more optional embodiments of this application, the loss based on text description is calculated using the following formula:

[0023] ,

[0024] In the formula, This represents the loss based on the text description. This represents the sequence of terms corresponding to the text description. This represents the historical lexical units preceding the t-th lexical unit. This represents the i-th ultrasound image. This represents the query instruction corresponding to the i-th ultrasound image. This represents the conditional probability that the model predicts for the next word.

[0025] Secondly, the present invention provides a method for training a VLA model for autonomous organ scanning, comprising:

[0026] The first dataset is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores.

[0027] Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge;

[0028] Obtain the second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method:

[0029] Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset.

[0030] For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair.

[0031] The multimodal base model is used as the initial model for the VLA model. Based on the second dataset, the VLA model is trained to obtain the trained VLA model.

[0032] Thirdly, the present invention provides an organ autonomous scanning device based on a VLA model, comprising:

[0033] The VLA model acquisition module is used to acquire a trained VLA model obtained using the VLA model training method for organ autonomous scanning described in the second aspect.

[0034] The autonomous scanning module is used to drive the ultrasound robot to perform autonomous scanning of organs based on the time observation sequence at the current moment and the trained VLA model.

[0035] Fourthly, the present invention provides a VLA model training device for autonomous organ scanning, comprising:

[0036] The first dataset module is used to obtain the first dataset, which is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores.

[0037] The multimodal basic model module is used to perform supervised fine-tuning training on the large multimodal model based on the first dataset to obtain a multimodal basic model infused with medical ultrasound expert knowledge.

[0038] The second dataset module is used to obtain a second dataset, which includes VLA model training sample pairs. The VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method:

[0039] Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset.

[0040] For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair.

[0041] The VLA model training module is used to use the multimodal base model as the initial model of the VLA model, and to train the VLA model based on the second dataset to obtain the trained VLA model.

[0042] Fifthly, embodiments of the present invention provide a computer-readable storage medium having a computer program / instruction stored thereon, which, when executed by a processor, implements the organ autonomous scanning method based on the VLA model as described in the first aspect above, and / or the VLA model training method for organ autonomous scanning as described in the second aspect.

[0043] In a sixth aspect, embodiments of the present invention provide a computer program product, including a computer program / instruction that, when executed by a processor, implements the organ autonomous scanning method based on the VLA model as described in the first aspect above, and / or the VLA model training method for organ autonomous scanning as described in the second aspect.

[0044] In a seventh aspect, embodiments of the present invention provide a computer device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the organ autonomous scanning method based on the VLA model as described in the first aspect above, and / or the VLA model training method for organ autonomous scanning as described in the second aspect.

[0045] The beneficial effects of the above-described technical solutions provided in the embodiments of the present invention include at least the following:

[0046] This invention provides an organ autonomous scanning method based on a VLA model. By constructing a first dataset consisting of ultrasound images, text descriptions, and quality scores, a multimodal large model is trained under supervision to obtain a multimodal basic model infused with medical ultrasound expert knowledge. This model possesses the ability to understand anatomical structures and recognize cross-sections in ultrasound images. Using datasets of ultrasound images, end-effector poses, and contact forces at different historical moments, and converting the end-effector poses and contact forces of the ultrasound robot at adjacent historical moments into incremental forms, a time observation sequence is constructed, consisting of the ultrasound image sequence, the end-effector pose increment sequence, and the contact force increment sequence at the current historical moment. The end-effector pose increment and contact force increment at the next historical moment are used as... Supervised labels are used to obtain VLA model training sample pairs, including time observation sequences and supervised labels, as a second dataset. The multimodal base model is used as the initial model for the VLA model. Based on the second dataset, the VLA model is trained to obtain a trained VLA model. This trained VLA model learns the mapping relationship from the time observation sequence of the ultrasound image at the current moment to the expert action (end-effector pose increment and contact force increment) at the next moment. Based on the time observation sequence at the current moment, the trained VLA model continuously predicts the end-effector pose increment and contact force increment at the next moment, i.e., the expert action at the next moment, thereby driving the ultrasound robot device to continuously perform scanning operations until autonomous scanning is completed. The method of this invention does not require ultrasound doctors to perform continuous manual scanning operations, which can save scanning time, reduce the workload of ultrasound doctors, and effectively improve the utilization efficiency of medical resources. Attached Figure Description

[0047] The accompanying drawings are provided to further illustrate the technical solutions of this application and constitute a part of the specification. They are used together with the embodiments of this application to explain the technical solutions of this application, and do not constitute a limitation on the technical solutions of this application. In the accompanying drawings:

[0048] Figure 1 This is a schematic diagram of the VLA model training method for autonomous organ scanning provided in an embodiment of this application;

[0049] Figure 2 A flowchart illustrating the autonomous scanning algorithm based on the VLA model provided in this application embodiment;

[0050] Figure 3 A schematic diagram of the organ autonomous scanning method based on the VLA model provided in this application embodiment;

[0051] Figure 4 This is a schematic diagram of an ultrasonic robot device provided in an embodiment of this application;

[0052] Figure 5A schematic diagram of a VLA model training device for autonomous organ scanning provided in an embodiment of this application;

[0053] Figure 6 A schematic diagram of an organ autonomous scanning device based on the VLA model provided in this application embodiment. Detailed Implementation

[0054] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. However, this description is exemplary and not restrictive, and those skilled in the art can implement this application in various forms without being limited to the embodiments set forth herein. These embodiments are provided to enable a more thorough understanding of this disclosure and to fully convey the scope of this disclosure to those skilled in the art.

[0055] The embodiments, features, and elements disclosed in this application can also be combined with any conventional features or elements to form a unique inventive scheme as defined by the claims. Any feature or element of any embodiment can also be combined with features or elements from other inventive schemes to form another unique inventive scheme as defined by the claims. Therefore, it should be understood that any feature shown and / or discussed in this application can be implemented individually or in any suitable combination.

[0056] To facilitate understanding of the technical solutions of the embodiments of this application, some terms or concepts involved in the embodiments of this application will be briefly described first.

[0057] 1. VLA Model: Visual-Language-Action, an artificial intelligence model that can directly map visual perception and language instructions into physical world actions or control signals end-to-end. It aims to enable machines (such as robots and autonomous vehicles) to "execute" by "seeing" and "understanding" like humans, achieving intelligent and direct interaction with the physical environment.

[0058] 2. Multimodal large models: These are large-scale artificial intelligence models that can simultaneously understand and process multiple types of information (such as text, images, audio, video, etc.). By learning from massive amounts of cross-modal data, they establish a general understanding of the world and can complete a variety of cross-modal tasks, such as describing images, summarizing videos, and generating images from text.

[0059] 3. SFT: Supervised Fine-Tuning is a method that uses high-quality, manually labeled data to further train a pre-trained large model.

[0060] 4. LoRA parameters: Low-Rank Adaptation parameters, which are parameters of a pair of trainable low-rank matrices injected into the original neural network layer in the Low-Rank Adaptation fine-tuning technique.

[0061] The inventors referenced existing methods for organ ultrasound scanning. These methods primarily rely on manual operation by doctors to perform organ ultrasound scanning. However, with the increasing demand for physical examinations, screenings, emergency assessments, and follow-up examinations, these time-consuming operations, which depend on doctors' experience and skills, have increased the burden on doctors and consumed limited medical resources. To address this issue, the inventors attempted to develop a method for autonomous organ scanning using an ultrasound robot or ultrasound acquisition device, based on widely used large-scale artificial intelligence models. They unexpectedly discovered that a first dataset composed of ultrasound images, text descriptions, and quality scores, used for supervised fine-tuning of a multimodal large-scale model, could yield a multimodal foundation model infused with medical ultrasound expert knowledge. Based on ultrasound images, end-effector poses, and contact force datasets from different historical moments, a VLA model training sample pair was constructed, including time observation sequences and supervisory labels (using the current historical moment's ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence as the time observation sequence, and the next historical moment's end-effector pose increment and contact force increment as the supervisory labels). This VLA model training sample pair was used as a second dataset. Using the aforementioned multimodal foundation model as the initial model for the VLA model, the VLA model was trained on the second dataset, resulting in a trained VLA model. This trained VLA model can drive an ultrasound robot to perform autonomous scanning of target organs, reducing the workload of ultrasound physicians and effectively improving the utilization efficiency of medical resources. Based on this, the present invention provides a method, related methods and apparatus for autonomous organ scanning based on VLA model.

[0062] Example 1

[0063] Embodiment 1 of this invention provides a VLA model training method for autonomous organ scanning, referring to... Figure 1 As shown, the method includes the following steps:

[0064] S101: Obtain the first dataset, which is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores.

[0065] S102: Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge.

[0066] S103: Obtain the second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method:

[0067] Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset.

[0068] For any historical moment, the ultrasound images, end-effector pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence. The end-effector pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair.

[0069] S104: Using the multimodal base model as the initial model for the VLA model, the VLA model is trained based on the second dataset to obtain the trained VLA model.

[0070] In this embodiment, step S101 uses an ultrasound robot or ultrasound acquisition device to collect a historical ultrasound image dataset I1 of the organ. The text description and quality score corresponding to each ultrasound image are obtained through description and scoring by a professional doctor, forming a text description dataset D1 and a quality score dataset Q1 corresponding to the ultrasound image dataset I1. Together, they constitute the first dataset (I1, D1, Q1) composed of ultrasound images, text descriptions, and quality scores. The text description is used to characterize the scanning site, visible anatomical structures, and cross-sectional features corresponding to the ultrasound image, while the quality score is used to characterize indicators such as image clarity, structural integrity, and usability.

[0071] Ultrasound images are generally characterized by high noise levels and unstable texture details. Furthermore, the boundaries and morphological features of key anatomical structures are often not prominent enough, making it difficult to extract effective semantic information directly from raw ultrasound images. Multimodal large-scale models (VLAs) are primarily trained on natural images and general visual semantics. Without the support of ultrasound domain knowledge, it is difficult to form a stable and transferable representational understanding of ultrasound images. Directly training a VLA model based on ultrasound images can easily lead to comprehension biases and learning instability, resulting in difficulty in effectively converging action predictions and failing to meet the requirements of autonomous scanning. Therefore, in this embodiment, based on the first dataset (I1, D1, Q1), expert knowledge in the medical ultrasound domain is injected into the multimodal large-scale model, enabling it to understand ultrasound images and providing a reliable visual semantic representation foundation for subsequent expert action learning and prediction.

[0072] In this embodiment, the multimodal basic model for injecting medical ultrasound expert knowledge in step S102 can be obtained in the following way:

[0073] The ultrasound images and corresponding question commands from the first dataset are input into the multimodal large model. Supervised fine-tuning training is performed using the following total loss function. When the predicted text output by the multimodal large model is consistent with the text description in the first dataset, and the output predicted score is consistent with the quality score in the first dataset, a multimodal basic model infused with medical ultrasound expert knowledge is obtained.

[0074] ,

[0075] In the formula, This represents the total loss of a large multimodal model. This represents the loss based on the text description. This represents the loss based on the quality score. and These represent the weighting coefficients based on text description and the weighting coefficients based on quality scores, respectively.

[0076] The questioning instructions include requesting the multimodal large model to provide a textual description of the ultrasound image and give a quality score.

[0077] In this embodiment, the first dataset can be specifically represented as:

[0078] Ultrasound image dataset I1={I1 (1) , I1 (2) , …, I1 (N) Text description dataset Quality rating dataset , where 1,2,…,N represent historical moments.

[0079] Construct sample pairs (I1) using the first dataset (i) , d1 (i) , q1 (i) ), forming an expert knowledge injection training set, and using this sample pair (I1) (i) , d1 (i) , q1 (i) The model inputs the text and corresponding question instructions into a multimodal large model, and supervised fine-tuning (SFT) is used to adapt the model to the domain. Each sample pair corresponds to a text instruction, such as: "You are an expert in ultrasound; please describe this ultrasound image, including the scanned area, current anatomical structure, and cross-sectional features, and provide a quality score." During SFT, the multimodal large model outputs predicted text and predicted score. The predicted text is compared with the text description, and the predicted score is compared with the quality score to obtain the loss based on text description and the loss based on quality score. The total loss function is then optimized by jointly using the losses based on text description and quality score. This enables the model to generate interpretable text while possessing stable quality judgment capabilities. Minimize the total loss function. That is, when the predicted text output by the multimodal large model is consistent with the text description in the first dataset, and the output predicted score is consistent with the quality score in the first dataset, a multimodal basic model infused with medical ultrasound expert knowledge is obtained. The loss based on text description can be calculated using the following autoregressive cross-entropy loss formula:

[0080] ,

[0081] In the formula, This represents the loss based on the text description. This represents the sequence of terms corresponding to the text description. This represents the historical lexical units preceding the t-th lexical unit. This represents the i-th ultrasound image. This represents the query instruction corresponding to the i-th ultrasound image. This represents the conditional probability that the model predicts for the next word.

[0082] The loss based on quality scores can be calculated using the following formula for mean squared error loss:

[0083] ,

[0084] In the formula, This represents the loss based on the quality score. This represents the predicted score corresponding to the i-th ultrasound image. Let i represent the quality score corresponding to the i-th ultrasound image. By calculating the difference between the predicted score and the quality score, the consistency between the model output score and the expert score is constrained.

[0085] In clinical or standardized data acquisition scenarios, an ultrasound physician remotely controls an ultrasound robot to perform ultrasound scans and simultaneously records the data generated during the scan. In this embodiment, in step S103, the physician operates the ultrasound robot to obtain an ultrasound image dataset I with a time series (different historical moments), an ultrasound robot end-effector pose dataset T, and a contact force dataset F. The end-effector pose and contact force at adjacent historical moments are converted into incremental forms, namely, the end-effector pose increment dataset ΔT and the contact force increment ΔF. A sliding time window method is used to construct VLA model training sample pairs: for any historical moment t, the ultrasound image sequence {I} of the current historical moment t and previous times is taken. t−n+1 ,…,I t}, and its corresponding end-effector pose increment sequence {ΔT t−n+1→t−n+2 ,…,ΔT t−1→t} and the contact force increment sequence {ΔF t−n+1→t−n+2 ,…,ΔF t−1→t Together, they form a time observation sequence; the actual operational increment of the doctor at the next moment (ΔT) t→t+1 ,ΔF t→t+1 (x) serves as a supervisory label, thus forming VLA model training sample pairs (x) t ,y t ), as the second dataset. Where x t The input for the time observation sequence is the ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence at the current historical moment, y t The expert action output for the next historical moment corresponding to this time observation sequence is the end pose increment and contact force increment for the next historical moment.

[0086] In this embodiment, the VLA model trained in step S104 can be obtained in the following way:

[0087] Based on the second dataset, LoRA parameters are inserted into the visual-language fusion layer and action prediction layer of the VLA model, and the weight coefficients based on text description and quality score in the initial model are frozen. Only the LoRA parameters and action prediction layer parameters are updated, and the VLA model is trained to obtain a trained VLA model.

[0088] The specific training process for the VLA model is as follows: a multimodal base model infused with medical ultrasound expert knowledge is used as the initial model for the VLA model. During training, low-rank adaptive parameters (LoRA) are inserted into the visual-language fusion layer and action prediction layer of the VLA model. By introducing a pair of low-rank matrices, the parameter updates required for these network layers are approximated, thereby enabling the learning of new tasks with very few additional parameters. The weights of the initial model are then frozen. and This approach fully preserves the ultrasound image representation capabilities learned from the multimodal base model, updating only the LoRA parameters and motion prediction layer parameters. This allows the model to efficiently adjust its cross-modal information integration method at the vision-language fusion layer via the LoRA mechanism, while the motion prediction layer is updated synchronously to directly optimize motion output, achieving end-to-end joint fine-tuning from multimodal input to motion output. Using a time-series observation x... t Given the input, predict the action at the next moment and label the actual expert action y. t The error between the two values ​​is used as a supervision signal, and the VLA model is trained using a total loss function weighted by the end-position pose increment and contact force increment losses as follows:

[0089] ,

[0090] In the formula, This represents the total loss of the VLA model. and These represent the increments in end-effector pose and contact force predicted by the model for the next time step, respectively. and These represent the actual end-effector pose increment and contact force increment at the next moment, respectively. and These represent the weighting coefficients corresponding to the end-effector pose increment and the contact force increment, respectively.

[0091] By minimizing the above loss function Backpropagation is performed on the LoRA parameters and the action prediction layer parameters to update them, enabling the model to learn the end-to-end mapping relationship from the time observation sequence to the expert's action at the next time step, thus obtaining a trained VLA model.

[0092] Reference Figure 2 As shown, based on ultrasound images, text descriptions, and quality scores, a multimodal large model is trained using STF to obtain a multimodal basic model infused with medical ultrasound expert knowledge. Then, based on the organ's historical ultrasound images, end-effector pose increments, and contact force increments obtained from remote operation of the ultrasound robot by the doctor, VLA model training sample pairs are constructed (e.g., ...). Figure 2The ultrasound image sequence, pose increment sequence, and force increment sequence are used to form the initial model of the VLA model. This model incorporates multimodal baseline models with expert knowledge from medical ultrasound specialists. Figure 2 The backbone model learns the mapping relationship between "time observation sequence → expert action at the next moment" and realizes the expert action at the next moment (ΔT). t→t+1 ,ΔF t→t+1 (Prediction).

[0093] Example 2

[0094] Embodiment 2 of this invention provides a method for autonomous organ scanning based on a VLA model, referring to... Figure 3 As shown, the method includes the following steps:

[0095] S201: Obtain the trained VLA model using the VLA model training method for autonomous organ scanning described in Example 1 above.

[0096] S202: Based on the time observation sequence at the current moment, the trained VLA model is used to drive the ultrasound robot to complete the autonomous scanning of the organ.

[0097] In this embodiment of the invention, the specific training process of the trained VLA model described in step S201 above can be referred to the detailed description of the VLA model training method for autonomous organ scanning recorded in Embodiment 1 above, and will not be repeated here.

[0098] This invention provides a method for autonomous organ scanning based on a VLA model. First, a multimodal large-scale model is trained by injecting medical ultrasound knowledge, enabling it to understand the anatomical structures and recognize cross-sections in ultrasound images. Then, this multimodal base model is used as the initial model for the VLA model. Based on VLA training samples composed of time observation sequences and supervision labels, the VLA model is trained, allowing it to predict the end-effector pose increment and contact force increment at the next moment based on the current time observation sequence—that is, the expert action at the next moment. This prediction drives the ultrasound robot to continuously perform scanning operations until autonomous scanning is completed. This method eliminates the need for continuous manual scanning by ultrasound physicians, saving scanning time, reducing their workload, and effectively improving the utilization efficiency of medical resources.

[0099] In this embodiment, a vision-language-action (VLA) model is used to guide an ultrasound robot device to autonomously scan human organs. The ultrasound robot device includes an interconnected robotic arm and a computer. Figure 4 As shown, a pre-trained VLA model is deployed on the computer to drive the movement trajectory of the robotic arm and adjust the probe contact force to complete the autonomous scanning of human organs.

[0100] In this embodiment, the time observation sequence at the current moment in step S202 (i.e., the ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence at the current moment, composed of ultrasound images at the current moment and previous moments, end-effector pose increment sequence, and contact force increment sequence) can be obtained based on rule search or by first executing a fixed straight-line path. The trained VLA model is deployed on the computer connected to the robotic arm, and the time observation sequence at the current moment is input into the trained VLA model. The trained VLA model continuously outputs the expert actions for the next moment, driving the ultrasound robot (robotic arm end) to complete the autonomous scanning of the organ.

[0101] In this embodiment, the autonomous scanning of organs in step S202 can be implemented through the following steps:

[0102] S2021: Input the current time observation sequence into the trained VLA model, and the trained VLA model outputs the end pose increment and contact force increment at the next time step.

[0103] S2022: Based on the end-effector pose increment and contact force increment at the next moment, drive the ultrasonic robot to perform autonomous scanning at the next moment, and simultaneously collect the ultrasonic image, end-effector pose increment and contact force increment at the next moment to obtain the time observation sequence at the next moment.

[0104] S2023: Based on the time observation sequence of the next moment, the trained VLA model is used to iteratively output the new end pose increment and contact force increment of the next moment and drive the ultrasound robot to perform autonomous scanning of the next moment until the autonomous scanning of the organ is completed.

[0105] The above steps S2021 to S2023, the autonomous organ scanning process, can be understood as follows: First, based on rule search or by executing a fixed straight path, an observation sequence over a period of time is obtained. Then, the sequence is input into a trained VLA model to obtain the expert action at the next moment and the scanning is performed. During the execution, new time observation sequences are continuously obtained, and expert action prediction is iteratively performed and scanning is executed. Based on the end pose trajectory at all moments, the cumulative scan coverage (Cov) is calculated by mapping the scanning path to the discrete representation (such as a mesh or voxel) of the target anatomical region. When the coverage (Cov) reaches or exceeds the preset target coverage threshold (Cov), the scanning is performed. th When the scan is complete, the iteration stops.

[0106] Example 3

[0107] Based on the same inventive concept, embodiments of the present invention also provide a VLA model training device for autonomous organ scanning, referring to... Figure 5 As shown, it includes:

[0108] The first dataset module 101 is used to obtain a first dataset, which is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing a first dataset composed of ultrasound images, text descriptions, and quality scores.

[0109] The multimodal basic model module 102 is used to perform supervised fine-tuning training on the multimodal large model based on the first dataset to obtain a multimodal basic model infused with medical ultrasound expert knowledge.

[0110] The second dataset module 103 is used to acquire a second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, and are obtained in the following manner:

[0111] Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset.

[0112] For any historical moment, the ultrasound images, end-effector pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end-effector pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence. The end-effector pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair.

[0113] VLA model training module 104 is used to use the multimodal base model as the initial model of the VLA model, and to train the VLA model based on the second dataset to obtain a trained VLA model.

[0114] Example 4

[0115] Based on the same inventive concept, embodiments of the present invention also provide an organ autonomous scanning device based on a VLA model, referring to... Figure 6 As shown, it includes:

[0116] VLA model acquisition module 201 is used to acquire a trained VLA model obtained using the VLA model training method for autonomous organ scanning described in Embodiment 1 above.

[0117] The autonomous scanning module 202 is used to drive the ultrasound robot to perform autonomous scanning of organs based on the time observation sequence at the current moment and the trained VLA model.

[0118] Example 5

[0119] Based on the same inventive concept, embodiments of the present invention provide a computer-readable storage medium storing a computer program / instruction thereon, which, when executed by a processor, implements the VLA model training method for autonomous organ scanning as described in Embodiment 1 above, and / or the VLA model-based autonomous organ scanning method described in Embodiment 2.

[0120] Example 6

[0121] Based on the same inventive concept, embodiments of the present invention provide a computer program product, including a computer program / instruction, which, when executed by a processor, implements the VLA model training method for autonomous organ scanning as described in Embodiment 1 above, and / or the VLA model-based autonomous organ scanning method described in Embodiment 2.

[0122] Example 7

[0123] Based on the same inventive concept, embodiments of the present invention provide a computer device, including a memory, a processor, and a computer program stored in the memory. The processor executes the computer program to implement the VLA model training method for autonomous organ scanning as described in Embodiment 1 above, and / or the VLA model-based autonomous organ scanning method described in Embodiment 2.

[0124] In this embodiment, when describing representative embodiments, the specification may have presented the method and / or process as a specific sequence of steps. However, the method or process should not be limited to the specific order of steps described herein, to the extent that the method or process does not depend on the specific order of steps described herein. Other sequences of steps are also possible, as understood by those skilled in the art. Therefore, the specific order of steps set forth in the specification should not be construed as a limitation of the claims. Furthermore, the claims for the method and / or process should not be limited to the steps performed in the written order, and those skilled in the art will readily understand that these orders can be varied and still remain within the spirit and scope of the embodiments of the invention.

[0125] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.

[0126] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. Those skilled in the art will understand that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0127] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0128] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

Claims

1. A method for autonomous organ scanning based on a VLA model, characterized in that, The method includes: Based on the time observation sequence at the current moment, the trained VLA model is used to drive the ultrasound robot to complete the autonomous scanning of organs. The trained VLA model is obtained by training in the following manner: The first dataset is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores. Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge; Obtain the second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method: Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset. For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair. The multimodal base model is used as the initial model for the VLA model. Based on the second dataset, the VLA model is trained to obtain the trained VLA model.

2. The organ autonomous scanning method based on the VLA model as described in claim 1, characterized in that, Based on the current time-series observations, and using a trained VLA model, the ultrasound robot is driven to perform autonomous scanning of organs, including: The current time observation sequence is input into the trained VLA model, and the trained VLA model outputs the end-effector pose increment and contact force increment at the next time step; the current time observation sequence includes the current ultrasound image sequence, the end-effector pose increment sequence, and the contact force increment sequence. Based on the end-effector pose increment and contact force increment at the next moment, the ultrasonic robot is driven to perform autonomous scanning at the next moment, and the ultrasonic image, end-effector pose increment and contact force increment at the next moment are collected simultaneously to obtain the time observation sequence at the next moment. Based on the time observation sequence of the next moment, the trained VLA model is used to iteratively output the new end pose increment and contact force increment of the next moment and drive the ultrasound robot to perform autonomous scanning of the next moment until the autonomous scanning of the organ is completed.

3. The organ autonomous scanning method based on the VLA model as described in claim 1 or 2, characterized in that, Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge, including: The ultrasound images and corresponding question commands from the first dataset are input into the multimodal large model. Supervised fine-tuning training is performed using the following total loss function. When the predicted text output by the multimodal large model is consistent with the text description in the first dataset, and the output predicted score is consistent with the quality score in the first dataset, a multimodal basic model infused with medical ultrasound expert knowledge is obtained. , In the formula, This represents the total loss of a large multimodal model. This represents the loss based on the text description. This represents the loss based on the quality score. and These represent the weighting coefficients based on text description and the weighting coefficients based on quality scores, respectively. The questioning instructions include requesting the multimodal large model to provide a textual description of the ultrasound image and give a quality score.

4. The organ autonomous scanning method based on the VLA model as described in claim 3, characterized in that, The loss based on text description is calculated using the following formula: , In the formula, This represents the loss based on the text description. This represents the sequence of terms corresponding to the text description. This represents the historical lexical units preceding the t-th lexical unit. This represents the i-th ultrasound image. This represents the query instruction corresponding to the i-th ultrasound image. This represents the conditional probability that the model predicts for the next word.

5. A method for training a VLA model for autonomous organ scanning, characterized in that, include: The first dataset is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores. Based on the first dataset, supervised fine-tuning training is performed on the multimodal large model to obtain a multimodal basic model infused with medical ultrasound expert knowledge; Obtain the second dataset, which includes VLA model training sample pairs; the VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method: Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset. For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair. The multimodal base model is used as the initial model for the VLA model. Based on the second dataset, the VLA model is trained to obtain the trained VLA model.

6. An organ autonomous scanning device based on a VLA model, characterized in that, include: The VLA model acquisition module is used to acquire a trained VLA model obtained using the VLA model training method for autonomous organ scanning as described in claim 5. The autonomous scanning module is used to drive the ultrasound robot to perform autonomous scanning of organs based on the time observation sequence at the current moment and the trained VLA model.

7. A VLA model training device for autonomous organ scanning, characterized in that, include: The first dataset module is used to obtain the first dataset, which is obtained by acquiring historical ultrasound images of organs and corresponding text descriptions and quality scores for each historical ultrasound image, and constructing the first dataset consisting of ultrasound images, text descriptions, and quality scores. The multimodal basic model module is used to perform supervised fine-tuning training on the large multimodal model based on the first dataset to obtain a multimodal basic model infused with medical ultrasound expert knowledge. The second dataset module is used to obtain a second dataset, which includes VLA model training sample pairs. The VLA model training sample pairs include time observation sequences and supervision labels, obtained through the following method: Based on the ultrasound images, end pose and contact force datasets of the ultrasound robot at different historical moments obtained by the doctor operating the ultrasound robot, the end pose and contact force of the ultrasound robot at adjacent historical moments are converted into incremental form to obtain the corresponding end pose incremental dataset and contact force incremental dataset. For any historical moment, the ultrasound images, end pose increments, and contact force increments of the current historical moment and previous moments are taken as the ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment, respectively. The ultrasound image sequence, end pose increment sequence, and contact force increment sequence of the current historical moment are taken as the time observation sequence, and the end pose increment and contact force increment of the next historical moment are taken as the supervision label to construct the VLA model training sample pair. The VLA model training module is used to use the multimodal base model as the initial model of the VLA model, and to train the VLA model based on the second dataset to obtain the trained VLA model.

8. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instruction is executed by the processor, it implements the organ autonomous scanning method based on the VLA model as described in any one of claims 1-4, and / or the VLA model training method for organ autonomous scanning as described in claim 5.

9. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instruction is executed by the processor, it implements the organ autonomous scanning method based on the VLA model as described in any one of claims 1-4, and / or the VLA model training method for organ autonomous scanning as described in claim 5.

10. A computer device, comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the organ autonomous scanning method based on the VLA model according to any one of claims 1-4, and / or the VLA model training method for organ autonomous scanning according to claim 5.