Large model-based object detection method and apparatus, and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using multimodal data fusion and large model network processing, the problems of large errors and poor robustness in existing target detection methods are solved, achieving high-precision and efficient target detection.

WO2026138835A1PCT designated stage Publication Date: 2026-07-02BEIJING XIAOYU INTELLISYS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: BEIJING XIAOYU INTELLISYS CO LTD
Filing Date: 2025-12-23
Publication Date: 2026-07-02

Application Information

Patent Timeline

23 Dec 2025

Application

02 Jul 2026

Publication

WO2026138835A1

IPC: G06V10/25

AI Tagging

Technology Topics

Data pack Point cloud

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A reinforcement learning congestion control method based on Transformer for time series modeling
CN121967328BImprove resource utilization Efficient capture Data packPacket loss
Cognitive modeling method and device for episodic memory tasks for cognitive impairment early screening
CN122314329AData pack Feature set
Method for measuring delay
CN122296032AQos quality of serviceData pack
Selective Data Compression for Non-Critical Memory Requests
US20260178330A1Memory systems Machine execution arrangements Data compression Data pack
Method and device for synchronization between video frame and audio frame
US12666097B2Selective content distribution Data pack Transceiver

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN2025144942_02072026_PF_FP_ABST

Patent Text Reader

Abstract

The present application relates to the technical fields of object detection and welding, and provides a large model-based object detection method and apparatus, and an electronic device. The method comprises: acquiring multimodal object data, wherein the multimodal object data comprises first image data of an object and first point cloud data of the object; inputting the multimodal object data into a large model, wherein the large model comprises a first encoder; performing at least one of coordinate transformation and dimension transformation on object data of a first modality by means of the first encoder to obtain target transformed data of the first modality; fusing the target transformed data of the first modality and object data of a second modality by means of the first encoder to obtain fused data; and performing object detection on the basis of the fused data by means of a network other than the first encoder in the large model.

Need to check novelty before this filing date? Find Prior Art

Description

Large-model-based target detection methods, devices, and electronic equipment

[0001] Cross-references to related applications

[0002] This disclosure is based on and claims priority to Chinese Patent Application No. 202411897852.5, filed on December 23, 2024, the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the fields of target detection and welding technology, and in particular to a target detection method, device, electronic device and storage medium based on a large model. Background Technology

[0004] Most target detection methods in related technologies are only applicable to certain specific scenarios. Errors in actual applications (such as tooling errors) have a significant impact on detection accuracy, resulting in problems such as poor robustness, poor generalization, and low accuracy. Summary of the Invention

[0005] This application aims to at least partially address one of the technical problems in the related art.

[0006] Therefore, the first objective of this application is to propose a target detection method based on a large model.

[0007] The second objective of this application is to propose a target detection device based on a large model.

[0008] The third objective of this application is to propose an electronic device.

[0009] The fourth objective of this application is to provide a computer-readable storage medium.

[0010] The fifth objective of this application is to provide a computer program product.

[0011] To achieve the above objectives, a first aspect of this application proposes a target detection method based on a large model, comprising: acquiring multimodal object data, wherein the multimodal object data includes first image data of the object and first point cloud data of the object; inputting the multimodal object data into a large model, wherein the large model includes a first encoder; performing at least one of coordinate transformation and dimension transformation on the first modality object data through the first encoder to obtain target transformation data of the first modality; fusing the target transformation data of the first modality with object data of a second modality through the first encoder to obtain fused data; and performing target detection based on the fused data through the remaining networks in the large model other than the first encoder.

[0012] To achieve the above objectives, a second aspect of this application proposes a target detection device based on a large model, comprising: an acquisition module for acquiring multimodal object data, wherein the multimodal object data includes first image data of the object and first point cloud data of the object; a processing module for inputting the multimodal object data into a large model, wherein the large model includes a first encoder; a transformation module for performing at least one of coordinate transformation and dimension transformation on the first modality object data through the first encoder to obtain target transformation data of the first modality; a fusion module for fusing the target transformation data of the first modality with object data of the second modality through the first encoder to obtain fused data; and a detection module for performing target detection based on the fused data through the remaining networks in the large model other than the first encoder.

[0013] To achieve the above objectives, a third aspect of this application provides an electronic device, including: a processor and a memory communicatively connected to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the method described in the first aspect of the application above.

[0014] To achieve the above objectives, a fourth aspect of this application provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, are used to implement the method described in the first aspect of the present application.

[0015] To achieve the above objectives, a fifth aspect of this application provides a computer program product including a computer program that, when executed by a processor, implements the method described in the first aspect of the above-described embodiment.

[0016] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0017] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0018] Figure 1 is a flowchart illustrating a target detection method based on a large model provided in an embodiment of this application;

[0019] Figure 2 is a flowchart illustrating another target detection method based on a large model provided in an embodiment of this application;

[0020] Figure 3 is a flowchart illustrating another target detection method based on a large model provided in an embodiment of this application;

[0021] Figure 4 is a schematic diagram of a large model provided in an embodiment of this application;

[0022] Figure 5 is a schematic diagram of a target detection method based on a large model provided in an embodiment of this application;

[0023] Figure 6 is a schematic flowchart of a model training method provided in an embodiment of this application;

[0024] Figure 7 is a schematic diagram of the structure of a target detection device based on a large model provided in an embodiment of this application. Detailed Implementation

[0025] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.

[0026] The following description, with reference to the accompanying drawings, describes a target detection method, apparatus, electronic device, storage medium, and computer program product based on a large model according to embodiments of this application.

[0027] Figure 1 is a flowchart illustrating a target detection method based on a large model provided in an embodiment of this application.

[0028] Most target detection methods in related technologies are only applicable to certain specific scenarios. Errors in actual applications (such as tooling errors) have a significant impact on detection accuracy, resulting in problems such as poor robustness, poor generalization, and low accuracy.

[0029] To address this issue, this application provides a target detection method based on a large model. This method can use multimodal data for target detection, improving the robustness, generalization, and accuracy of target detection. The method utilizes a first encoder to perform at least one of the following processes on object data of a certain modality: coordinate transformation and dimension transformation, to obtain target transformation data of that modality. This enables multimodal data fusion, and the remaining networks of the large model perform target detection based on the fused data. This simplifies user operation, improves target detection efficiency, and optimizes the user experience in target detection scenarios. It is suitable for weld inspection scenarios.

[0030] As shown in Figure 1, the method includes the following steps:

[0031] S101, acquire multimodal object data, wherein the multimodal object data includes the object's first image data and the object's first point cloud data.

[0032] It should be noted that the objects are not overly limited. For example, in a weld inspection scenario, the object can include the workpiece; in an autonomous driving scenario, the object can include roads; and in a security monitoring scenario, the object can include factories, offices, etc. The first image data is also not overly limited; for example, it can include RGB images, depth images, thermal imaging images, multispectral images, and multispectral images. In addition to the first image data and the first point cloud data, multimodal object data can also include object data from other modalities, such as radar data.

[0033] In some embodiments, acquiring multimodal object data may include acquiring first image data and first point cloud data via a 3D (Dimensional) camera.

[0034] In some embodiments, acquiring multimodal object data may include acquiring first image data via an RGB camera or acquiring first point cloud data via a laser scanning device.

[0035] Most target detection methods in related technologies are only applicable to certain specific scenarios. Errors in actual applications (such as tooling errors) have a significant impact on detection accuracy, resulting in problems such as poor robustness, poor generalization, and low accuracy.

[0036] Taking weld inspection as an example, related technologies can extract weld information by finding planes or predefined standard arc surfaces in the workpiece point cloud and calculating the intersection of the two planes. This method requires predefined surface formulas or normal vector thresholds when calculating the surface, and is only applicable to specific scenarios, such as corner welds in steel structures. Furthermore, if the workpiece surface has tooling errors such as protrusions, depressions, or dirt, the accuracy of the weld inspection results will be low, meaning the method has poor generalization ability. Additionally, this method operates directly in the 3D point cloud, which is also inefficient, often taking tens of seconds.

[0037] In this application, multimodal data can be used for target detection, that is, at least image data and point cloud data are used for target detection. Image data carries semantic information and point cloud data carries spatial information. Target detection can be performed by comprehensively considering both semantic and spatial information, which is applicable to various target detection scenarios. In actual applications, errors have a smaller impact on detection accuracy, thus improving the robustness, generalization and accuracy of target detection.

[0038] Especially in weld inspection scenarios, this application can comprehensively consider semantic and spatial information for weld inspection, and is applicable to a variety of weld inspection scenarios. The error in actual application has little impact on the detection accuracy. For example, if there are tooling errors such as protrusions, depressions, or dirt on the workpiece surface, the detection accuracy of this application is still high, which improves the robustness, generalization and accuracy of target detection.

[0039] In addition, spatial information can be taken into account in weld inspection in this application, which helps to distinguish welds in different spatial locations (such as horizontal, vertical, upright, and overhead) as well as multiple welds that are adhered together.

[0040] S102, input multimodal object data into a large model, wherein the large model includes a first encoder.

[0041] It's important to note that large models refer to machine learning models with a massive number of parameters and high complexity. They require significant computational resources and storage for training and storage, and often necessitate distributed computing and specialized hardware acceleration techniques. Large models possess stronger generalization and expressive capabilities. Any large model from relevant technologies can be used to implement a large model; no specific limitations are imposed here. For example, the Transformer model could be used. It should be noted that the Transformer model is a neural network model based on a self-attention mechanism.

[0042] In some embodiments, the first encoder is a patch embedding network.

[0043] Existing target detection methods are complex to operate and have low efficiency, resulting in a poor user experience in target detection scenarios. This application utilizes a large model for target detection, simplifying user operations, improving detection efficiency, and optimizing the user experience. Furthermore, with continuous data accumulation, the large model can be iteratively optimized, contributing to improved target detection accuracy.

[0044] Especially in weld inspection scenarios, this application only requires 500ms to identify a single weld, which improves weld inspection efficiency and optimizes the user experience in weld inspection scenarios.

[0045] S103, the object data of the first modality is processed by the first encoder through at least one of coordinate transformation and dimension transformation to obtain the target transformation data of the first modality.

[0046] It should be noted that multimodal data includes object data in both the first and second modalities. No specific restrictions are placed on the methods of coordinate and dimension transformation. The first and second modalities are different modalities. The coordinate system and dimensions of the target transformed data in the first modality are consistent with those of the object data in the second modality; that is, the coordinate system of the target transformed data in the first modality is consistent with the coordinate system of the object data in the second modality (i.e., the second coordinate system below), and the dimensions of the target transformed data in the first modality are consistent with those of the object data in the second modality.

[0047] In some embodiments, the object data of the first modality may include the first point cloud data in step S101, and the object data of the second modality may include the first image data in step S101, in which case the first modality includes the point cloud modality; or, the object data of the first modality may include the first image data in step S101, and the object data of the second modality may include the first point cloud data in step S101. This application does not limit the types of data included in the object data of the first modality and the object data of the second modality.

[0048] In the embodiments of this application, the object data of the first modality is processed by the first encoder through at least one of coordinate transformation and dimension transformation to obtain the target transformation data of the first modality, including the following possible implementation methods:

[0049] Method 1: In response to the fact that the first coordinate system of the object data of the first modality is consistent with the second coordinate system of the object data of the second modality, and the dimensions of the object data of the first modality are inconsistent with the dimensions of the object data of the second modality, the object data of the first modality is transformed by the first encoder to obtain the third transformed data, which is used as the target transformed data of the first modality, wherein the dimensions of the third transformed data are consistent with the dimensions of the object data of the second modality.

[0050] Therefore, when the coordinate systems of the object data of the two modalities are consistent, but their dimensions are inconsistent, the first encoder can be used to perform dimension transformation on the object data of the first modality to obtain the target transformation data of the first modality. At this time, only dimension transformation is required.

[0051] Method 2: In response to the inconsistency between the first coordinate system of the object data in the first mode and the second coordinate system of the object data in the second mode, and the consistency between the dimensions of the object data in the first mode and the dimensions of the object data in the second mode, the object data in the first mode is transformed by the first encoder to obtain the first transformed data in the second coordinate system, which is used as the target transformed data of the first mode.

[0052] Therefore, when the coordinate systems of the object data of the two modalities are inconsistent, but their dimensions are consistent, the first encoder can be used to perform coordinate transformation on the object data of the first modal to obtain the target transformation data of the first modal. At this time, only coordinate transformation is required.

[0053] Method 3: The first point cloud data is transformed by the first encoder to obtain the second point cloud data in the camera coordinate system. The second point cloud data is then transformed by the first encoder to obtain the third point cloud data, which serves as the target transformation data for the point cloud modality. The dimension of the third point cloud data is consistent with the dimension of the first image data.

[0054] It is understandable that the coordinate system of the image data is the camera coordinate system, while the coordinate systems of the first point cloud data and the image data are not the same, and their dimensions are also different.

[0055] Therefore, the first encoder can be used to convert the point cloud data to the camera coordinate system, and then the converted point cloud data can be dimensionally transformed. That is, the first encoder is used to perform coordinate transformation and dimension transformation on the point cloud data to obtain the target transformation data of the point cloud modality, so that the coordinate system and dimension of the final point cloud data are consistent with the image data, so as to facilitate the subsequent fusion of point cloud data and image data.

[0056] In some embodiments, the coordinate system of the first point cloud data is the world coordinate system. The first encoder performs coordinate transformation on the first point cloud data to obtain the second point cloud data in the camera coordinate system. This can be achieved using the following formula: P camera =RP world +T

[0057] Among them, P camera The second point cloud data is a 3x1 vector representing the coordinates of a 3D point in the camera coordinate system, P. world Let R be the first point cloud data, which is a 3x1 vector representing the coordinates of the 3D point in the world coordinate system. Let R be a 3x3 rotation matrix representing the rotation operation from the world coordinate system to the camera coordinate system. Let T be a 3x1 translation vector representing the translation operation from the world coordinate system to the camera coordinate system.

[0058] In some embodiments, the dimensions of the first image data are [b,c,H,W], where b is the batch size, i.e., the number of images processed at one time, c is the number of channels in the image, H is the height of the entire image, and W is the width of the entire image. For example, taking an RGB image as an example, an RGB image includes three channels: R (red), G (green), and B (blue), in which case c = 3.

[0059] The second point cloud data has dimensions [b, M, 3], where b is the batch size and M is the number of 3D points. 3D points can be represented by three-dimensional coordinates (such as x, y, z coordinates). That is, the second point cloud data includes the three-dimensional coordinates of M 3D points. For example, the second point cloud data includes three channels: x, y, and z.

[0060] The second point cloud data is transformed in dimension by the first encoder to obtain the third point cloud data. This may include filling the second point cloud data into a tensor of dimension [b,c,H,W] by the first encoder, and using the filled tensor of dimension [b,c,H,W] as the third point cloud data.

[0061] For example, the data from different channels of the second point cloud data are filled into different channels within a tensor of dimension [b,c,H,W].

[0062] For example, the second point cloud data is filled into a tensor of dimension [b,c,H,W] by the first encoder. This includes filling the j-th channel of the second point cloud data into the j-th channel of the tensor of dimension [b,c,H,W] by the first encoder. Here, j is a positive integer not greater than M.

[0063] For example, the second point cloud data is filled into a tensor of dimension [b,c,H,W] by the first encoder. This includes projecting the j-th 3D point onto the image plane by the first encoder to obtain the pixel position of the j-th 3D point, and filling the three-dimensional coordinates of the j-th 3D point into the pixel position of the j-th 3D point in the tensor of dimension [b,c,H,W] by the first encoder.

[0064] For example, the second point cloud data includes the x, y, and z coordinates of 3D point 1, and the pixel position of 3D point 1 is (H1, W1). The first encoder can fill the x coordinate of 3D point 1 into the data point of the first channel with the pixel position (H1, W1) in a tensor of dimension [b, c, H, W].

[0065] The first encoder fills the y-coordinate of 3D point 1 into the tensor of dimension [b,c,H,W], with the pixel position (H1,W1) and the data point of the second channel.

[0066] The z-coordinate of 3D point 1 is filled into the tensor of dimension [b,c,H,W] by the first encoder, with the pixel position (H1,W1) and the data point of the third channel.

[0067] If a data point in a tensor of dimensions [b,c,H,W] does not have a second point cloud data that can be filled, the data point can be filled with 0, that is, missing parts can be filled with 0.

[0068] S104, the target transformation data of the first mode and the object data of the second mode are fused by the first encoder to obtain fused data.

[0069] It should be noted that no specific restrictions are placed on the methods of data fusion.

[0070] For example, multimodal object data includes object data of mode A, object data of mode B, and object data of mode C.

[0071] The object data of mode A can be processed by a first encoder using at least one of coordinate transformation and dimension transformation to obtain target transformation data of mode A. Similarly, the object data of mode B can be processed by the first encoder using at least one of coordinate transformation and dimension transformation to obtain target transformation data of mode B. Finally, the target transformation data of mode A, the target transformation data of mode B, and the object data of mode C can be fused using the first encoder to obtain fused data. It is understood that in this embodiment, modes A and B are first modes, and mode C is a second mode. The coordinate system and dimension of the target transformation data of mode A are consistent with those of the object data of mode C, and the coordinate system and dimension of the target transformation data of mode B are consistent with those of the object data of mode C.

[0072] The object data of mode A can be processed by a first encoder using at least one of coordinate transformation and dimension transformation to obtain target transformation data of mode A. Similarly, the object data of mode C can be processed by the first encoder using at least one of coordinate transformation and dimension transformation to obtain target transformation data of mode C. Finally, the target transformation data of mode A, the target transformation data of mode C, and the object data of mode B can be fused using the first encoder to obtain fused data. It is understood that in this embodiment, modes A and C are first modes, and mode B is a second mode. The coordinate system and dimension of the target transformation data of mode A are consistent with those of the object data of mode B, and the coordinate system and dimension of the target transformation data of mode C are consistent with those of the object data of mode B.

[0073] In the embodiments of this application, the target transformation data of the first modality and the object data of the second modality are fused by the first encoder, including the following possible implementation methods:

[0074] Method 1: Extract features from the first image data using the first encoder to obtain the first feature; extract features from the third point cloud data using the first encoder to obtain the second feature; and then stitch the first feature and the second feature together using the first encoder to obtain fused data.

[0075] Therefore, the first encoder can extract features from the first image data and the third point cloud data respectively to obtain the first feature and the second feature, and then stitch the first feature and the second feature together to achieve the fusion of image data and point cloud data.

[0076] It should be noted that the first feature is not overly limited; for example, it can include image features, semantic features, etc. It should be clarified that image features describe the basic visual elements in an image, and may include color features, texture features, shape features, and spatial relationship features, etc. Semantic features describe the image content, and may include the category, action, posture, and emotion of instances in the image, etc. The terms "instance," "entity," and "target" are interchangeable.

[0077] It should be noted that there are no strict limitations on the second feature; for example, it may include spatial features.

[0078] In some embodiments, feature extraction of the first image data by the first encoder to obtain the first feature includes segmenting the first image data by the first encoder to obtain the second image data of each of N image blocks, where N is a positive integer, and extracting features from the second image data of the i-th image block by the first encoder to obtain the first feature of the i-th image block, where i is a positive integer not greater than N.

[0079] In some embodiments, feature extraction of the third point cloud data by the first encoder to obtain the second feature includes segmenting the third point cloud data by the first encoder to obtain the fourth point cloud data of each of the N image blocks, and extracting the feature of the fourth point cloud data of the i-th image block by the first encoder to obtain the second feature of the i-th image block.

[0080] For example, the dimensions of the first image data and the third point cloud data are both [b,c,H,W], and the dimensions of the second image data and the fourth point cloud data are both [b,h,w,c×p1×p2]. Here, h is the number of image blocks segmented in the height direction, w is the number of image blocks segmented in the width direction, p1 is the height of the image block, and p2 is the width of the image block.

[0081] In some embodiments, the target detection method based on a large model further includes normalizing a first feature to update the first feature, and normalizing a second feature to update the second feature.

[0082] In some embodiments, the first feature and the second feature are stitched together by the first encoder to obtain fused data, including stitching together the first feature of the i-th image block and the second feature of the i-th image block by the first encoder to obtain fused data of the i-th image block.

[0083] Method 2: The target conversion data of the first mode and the object data of the second mode are concatenated by the first encoder to obtain concatenated data, which is used as fused data.

[0084] Method 3: The target transformation data of the first modality and the object data of the second modality are concatenated by the first encoder to obtain concatenated data. Feature extraction is performed on the concatenated data by the first encoder to obtain fused data.

[0085] S105 performs target detection based on fused data using the remaining networks in the large model, excluding the first encoder.

[0086] It should be noted that the target detection results are not subject to many limitations. For example, they can include the segmentation mask, classification results, and bounding boxes corresponding to the first image data. The segmentation mask refers to the segmentation mask of instances in the image. The terms segmentation mask, mask, etc., are interchangeable. There are no major limitations on the segmentation mask; it can include arrays, multi-valued images (such as binary images), grayscale images, etc. The classification results are also not subject to many limitations; they can include the category of instances in the image, the category of pixels in the image, and the probability that a pixel belongs to the target category. The bounding box refers to the bounding box of instances in the image.

[0087] In some embodiments, taking a weld seam detection scenario as an example, the target detection result is the weld seam detection result, which includes the segmentation mask of the weld seam in the image, whether a pixel in the image is a weld seam, the probability that a pixel belongs to a weld seam, the bounding box of the weld seam in the image, the weld seam trajectory, etc.

[0088] In some embodiments, the target detection method based on a large model further includes route planning for the welding robot based on weld seam detection results. For example, route planning for the welding robot can be performed based on the weld seam trajectory. Thus, in weld seam detection scenarios, route planning for the welding robot can also be performed based on weld seam detection results to guide the robot to perform precise welding, which is suitable for intelligent robotic welding scenarios.

[0089] In summary, the target detection method based on a large model according to the embodiments of this application acquires multimodal object data, which includes first image data and first point cloud data of the object. The multimodal object data is input into a large model, which includes a first encoder. The first encoder performs at least one of coordinate transformation and dimension transformation on the first modality object data to obtain first modality target transformation data. The first encoder then fuses the first modality target transformation data with second modality object data to obtain fused data. Target detection is then performed based on the fused data using the remaining networks in the large model other than the first encoder. Therefore, multimodal data can be used for target detection, improving the robustness, generalization, and accuracy of target detection. The first encoder can be used to perform at least one of coordinate transformation and dimension transformation on object data of a certain modality to obtain target transformation data of that modality, thereby achieving multimodal data fusion. Target detection is then performed based on the fused data using the remaining networks of the large model, simplifying user operation, improving target detection efficiency, and optimizing the user experience in target detection scenarios. This method is suitable for weld inspection scenarios.

[0090] This application provides another target detection method based on a large model.

[0091] As shown in Figure 2, the method may include the following steps:

[0092] S201, acquire multimodal object data, wherein the multimodal object data includes the object's first image data and the object's first point cloud data.

[0093] S202, inputting multimodal object data into a large model, wherein the large model includes a first encoder.

[0094] For details regarding steps S201-S202, please refer to the above embodiments, which will not be repeated here.

[0095] S203, in response to the inconsistency between the first coordinate system of the object data in the first mode and the second coordinate system of the object data in the second mode, the object data in the first mode is transformed by the first encoder to obtain the first transformed data in the second coordinate system.

[0096] S204, in response to the inconsistency between the dimension of the first transformed data and the dimension of the object data of the second modality, the first transformed data is transformed by the first encoder to obtain the second transformed data, which is used as the target transformed data of the first modality, wherein the dimension of the second transformed data is consistent with the dimension of the object data of the second modality.

[0097] For example, multimodal object data includes object data of mode A and object data of mode B. Mode A is the first mode, and mode B is the second mode.

[0098] Since the first coordinate system of the object data in mode A is inconsistent with the second coordinate system of the object data in mode B, the object data in mode A is transformed by the first encoder to obtain the first transformed data in the second coordinate system.

[0099] In response to the inconsistency between the dimension of the first transformed data and the dimension of the object data of modality B, the first encoder performs a dimension transformation on the first transformed data to obtain the second transformed data, which serves as the target transformed data for modality A. The dimension of the second transformed data is consistent with the dimension of the object data of modality B.

[0100] In some embodiments, the first transformed data is dimensionally transformed by a first encoder to obtain second transformed data. This includes performing dimensionality upscaling on the first transformed data in response to the dimension of the object data in the second modality being lower than the dimension of the first transformed data. It should be noted that the dimensionality upscaling method is not overly limited; for example, it may include feature construction, multinomial features, encoding, etc.

[0101] In some embodiments, the first transformed data is dimensionally transformed by a first encoder to obtain second transformed data. This includes dimensionality reduction processing of the first transformed data by the first encoder in response to the first transformed data having a higher dimension than the object data of the second modality. It should be noted that the dimensionality reduction method is not limited in many ways; for example, it may include PCA (Principal Component Analysis), Laplacian eigenmaps, etc.

[0102] S205, in response to the fact that the dimension of the first transformation data is consistent with the dimension of the object data of the second modality, the first transformation data is used as the target transformation data of the first modality by the first encoder.

[0103] For example, multimodal object data includes object data of mode A and object data of mode B. Mode A is the first mode, and mode B is the second mode.

[0104] Since the first coordinate system of the object data in mode A is inconsistent with the second coordinate system of the object data in mode B, the object data in mode A is transformed by the first encoder to obtain the first transformed data in the second coordinate system.

[0105] In response to the fact that the dimension of the first transformation data is consistent with the dimension of the object data of mode B, the first transformation data is used as the target transformation data of mode A by the first encoder.

[0106] S206, the target transformation data of the first mode and the object data of the second mode are fused by the first encoder to obtain fused data.

[0107] S207 performs object detection based on fused data using the remaining networks in the large model, excluding the first encoder.

[0108] For details regarding steps S206-S207, please refer to the above embodiments; they will not be repeated here.

[0109] In summary, according to the target detection method based on a large model according to the embodiments of this application, when the coordinate systems of the object data of the two modalities are inconsistent, the first encoder can be used to perform coordinate transformation on the object data of the first modality to obtain the first transformed data. When the dimensions of the first transformed data and the object data of the second modality are consistent, the first transformed data can be directly used as the target transformed data of the first modality. Alternatively, when the dimensions of the first transformed data and the object data of the second modality are inconsistent, the first encoder can be used to perform dimension transformation on the first transformed data to obtain the target transformed data of the first modality.

[0110] This application provides another target detection method based on a large model.

[0111] As shown in Figure 3, the method may include the following steps:

[0112] S301, acquire multimodal object data, wherein the multimodal object data includes the object's first image data and the object's first point cloud data.

[0113] S302, input multimodal object data into a large model, wherein the large model includes a first encoder.

[0114] For details regarding steps S301-S302, please refer to the above embodiments, which will not be repeated here.

[0115] S303, in response to the inconsistency between the dimensions of the object data in the first modality and the object data in the second modality, the object data in the first modality is dimensionally transformed by the first encoder to obtain third transformed data, wherein the dimensions of the third transformed data are consistent with the dimensions of the object data in the second modality.

[0116] S304, in response to the inconsistency between the third coordinate system of the third transformation data and the second coordinate system of the object data of the second mode, the third transformation data is transformed by the first encoder to obtain the fourth transformation data in the second coordinate system, which is used as the target transformation data of the first mode.

[0117] For example, multimodal object data includes object data of mode A and object data of mode B. Mode A is the first mode, and mode B is the second mode.

[0118] In response to the inconsistency between the dimensions of the object data in modality A and the object data in modality B, the object data in modality A is dimensionally transformed by the first encoder to obtain the third transformed data, wherein the dimensions of the third transformed data are consistent with the dimensions of the object data in modality B.

[0119] Since the third coordinate system of the third transformation data is inconsistent with the second coordinate system of the object data of mode B, the third transformation data is transformed by the first encoder to obtain the fourth transformation data in the second coordinate system, which is used as the target transformation data of mode A.

[0120] It should be noted that the relevant content on coordinate transformation and dimension transformation can be found in the above embodiments, and will not be repeated here.

[0121] S305, in response to the alignment of the third coordinate system with the second coordinate system, uses the third transformation data as the target transformation data of the first mode through the first encoder.

[0122] For example, multimodal object data includes object data of mode A and object data of mode B. Mode A is the first mode, and mode B is the second mode.

[0123] In response to the inconsistency between the dimensions of the object data in modality A and the object data in modality B, the object data in modality A is transformed by the first encoder to obtain the third transformed data, wherein the dimensions of the third transformed data are consistent with the dimensions of the object data in modality B.

[0124] The third coordinate system of the third transformation data is consistent with the second coordinate system of the object data of mode B, and the third transformation data is used as the target transformation data of mode A by the first encoder.

[0125] S306, the target transformation data of the first mode and the object data of the second mode are fused by the first encoder to obtain fused data.

[0126] S307 performs object detection based on fused data using the remaining networks in the large model, excluding the first encoder.

[0127] For details regarding steps S306-S307, please refer to the above embodiments; they will not be repeated here.

[0128] In summary, according to the target detection method based on a large model according to the embodiments of this application, when the dimensions of the object data of the two modalities are inconsistent, the first encoder can be used to perform dimension transformation on the object data of the first modality to obtain the third transformed data. When the coordinate systems of the third transformed data and the object data of the second modality are consistent, the third transformed data can be directly used as the target transformed data of the first modality. Alternatively, when the coordinate systems of the third transformed data and the object data of the second modality are inconsistent, the first encoder can be used to perform coordinate transformation on the third transformed data to obtain the target transformed data of the first modality.

[0129] Based on any of the above embodiments, as shown in Figure 4, the large model also includes a feature extraction network, a decoder, a second encoder, and an MLP (Multilayer Perceptron).

[0130] It should be noted that the feature extraction network is not subject to many limitations. For example, as shown in Figure 4, the feature extraction network includes a backbone network, a position encoder, and a third encoder. In some embodiments, the backbone network is a ResNet50, which is a deep convolutional neural network. The position encoder is a position embeddings network, and the third encoder is a 6-layer encoding network, which can be composed of 6 encoders connected in series.

[0131] In some embodiments, the decoder is a 9-layer decoding network, for example, it may be composed of 9 decoders connected in series.

[0132] In some embodiments, the second encoder is a pixel embedding network.

[0133] This application provides another target detection method based on a large model.

[0134] As shown in Figure 5, the method may include the following steps:

[0135] S501, acquire multimodal object data, wherein the multimodal object data includes the object's first image data and the object's first point cloud data.

[0136] S502, input multimodal object data into a large model, wherein the large model includes a first encoder.

[0137] S503, the object data of the first modality is processed by the first encoder through at least one of coordinate transformation and dimension transformation to obtain the target transformation data of the first modality.

[0138] S504, the first encoder fuses the target transformation data of the first mode with the object data of the second mode to obtain fused data.

[0139] For details regarding steps S501-S504, please refer to the above embodiments; they will not be repeated here.

[0140] S505 uses a feature extraction network to extract features from the fused data to obtain a third feature.

[0141] In some embodiments, feature extraction is performed on the fused data through a feature extraction network to obtain a third feature, including multi-scale feature extraction of the fused data through a feature extraction network to obtain multi-scale features as the third feature.

[0142] In some embodiments, the feature extraction network includes a backbone network, a position encoder, and a third encoder. The feature extraction network extracts features from the fused data to obtain a third feature, including extracting features from the fused data through the backbone network to obtain a sixth feature, encoding the sixth feature at its position using the position encoder to obtain position-encoded data of the sixth feature, and then extracting features from the sixth feature and the position-encoded data using the third encoder to obtain the third feature. Thus, the position encoder can be used to obtain the position-encoded data of the sixth feature, allowing the third encoder to understand the positional information of the elements contained in the sixth feature, thereby assisting the third encoder in feature extraction and improving the accuracy of feature extraction.

[0143] S506 uses a decoder to perform target detection based on the third feature to obtain the fourth feature.

[0144] S507, the second encoder extracts features from the third feature and the fused data to obtain the fifth feature.

[0145] S508, using a multilayer perceptron based on the fourth and fifth features, obtains the segmentation mask corresponding to the first image data.

[0146] In some embodiments, a segmentation mask corresponding to the first image data is obtained by using a multilayer perceptron based on the fourth and fifth features. This includes extracting features from the fourth feature using the multilayer perceptron to obtain the seventh feature, and calculating the dot product between the fifth and seventh features using the multilayer perceptron as the segmentation mask corresponding to the first image data.

[0147] In some embodiments, the segmentation mask corresponding to the first image data is obtained by a multilayer perceptron based on the fourth and fifth features, which can be achieved by the following formula:

[0148] mask_embed=self.mask_embed(decoder_output)

[0149] outputs_mask=torch.einsum("bqc,bchw->bqhw",mask_embed,mask_features)

[0150] Here, decoder_output is the output data of the decoder, i.e. the fourth feature, self.mask_embed(·) is the function corresponding to the MLP, mask_embed is the seventh feature, mask_features is the fifth feature, and outputs_mask is the segmentation mask.

[0151] torch.einsum(h) is a function used to perform the Einstein summation convention. bqc indicates that the mask_embed tensor has three dimensions: batch size (b), number of categories (q), and number of channels (c). bchw indicates that the mask_features tensor has four dimensions: batch size (b), number of channels (c), height (h), and width (w). bqhw indicates that the output tensor has four dimensions: batch size (b), number of categories (q), height (h), and width (w).

[0152] In this embodiment, the torch.einsum(·) function can calculate the dot product between mask_embed and mask_features to obtain a tensor outputs_mask with dimensions (b,q,h,w).

[0153] S509, using a multilayer perceptron based on the fourth feature, obtains the classification result corresponding to the first image data.

[0154] S510 obtains the bounding box corresponding to the first image data based on the fourth feature through a multilayer perceptron.

[0155] It should be noted that there are no strict restrictions on the execution order of steps S508-S510; for example, they can be executed sequentially or in parallel.

[0156] For example, classification results, segmentation masks, and bounding boxes can be obtained separately using different multilayer perceptrons. Therefore, image classification, bounding box generation, and mask generation tasks can be performed using different multilayer perceptrons, which is more efficient than using the same network for all three tasks in related technologies.

[0157] In some embodiments, as shown in FIG4, the segmentation mask corresponding to the first image data can be obtained by multilayer perceptron 1 based on the fourth and fifth features, the classification result corresponding to the first image data can be obtained by multilayer perceptron 2 based on the fourth feature, and the bounding box corresponding to the first image data can be obtained by multilayer perceptron 3 based on the fourth feature.

[0158] In summary, the target detection method based on a large model according to the embodiments of this application can improve the accuracy of the segmentation mask by comprehensively considering the fourth and fifth features through a multilayer perceptron, and perform image classification and bounding box generation by considering the fourth feature through a multilayer perceptron.

[0159] This application provides a model training method.

[0160] As shown in Figure 6, the method may include the following steps:

[0161] S601, based on the predicted classification results corresponding to the sample image data, obtains the first loss function of the large model.

[0162] The sample image data includes multiple sample images.

[0163] It should be noted that there are no strict limitations on the first loss function. For example, it can include caustic loss, CE (Cross Entropy), MSE (Mean-Square Error), KL (Kullback-Leibler) divergence, contrastive loss function, etc.

[0164] In some embodiments, a first loss function for the large model is obtained based on the predicted classification result corresponding to the sample image data. This includes obtaining the sample classification result corresponding to the sample image data and obtaining the first loss function based on the predicted classification result and the sample classification result. The sample classification result can refer to the manually labeled, true classification result of the sample image data.

[0165] In some embodiments, taking the caustic loss function as an example, the first loss function L1 is as follows: L1=-a t (1-p t1 ) γ log(p t1 )

[0166] Where, p t1 Let a be the predicted probability that a pixel in the sample image belongs to the target category. t γ is a balancing factor used to adjust the weight ratio between positive and negative samples, while γ is a regulating factor used to control the weight of difficult-to-classify samples.

[0167] S602, based on the predicted bounding boxes corresponding to the sample image data, obtains the second loss function of the large model.

[0168] It should be noted that there are no strict limitations on the second loss function. For example, it may include the IoU (Intersection over Union) loss function, caustic loss function, CE, MSE, KL divergence, contrastive loss function, etc.

[0169] In some embodiments, a second loss function for the large model is obtained based on the predicted bounding boxes corresponding to the sample image data. This includes obtaining the sample bounding boxes corresponding to the sample image data and obtaining the second loss function based on the predicted bounding boxes and the sample bounding boxes. The sample bounding boxes can refer to manually labeled ground truth bounding boxes corresponding to the sample image data.

[0170] In some embodiments, taking the IoU loss function as an example, the second loss function L2 is as follows:

[0171] Where A” is the predicted bounding box, B″ is the sample bounding box, Intersection(A″,B″) is the intersection area between A” and B″, and Union(A″,B″) is the union area between A″ and B″.

[0172] S603, based on the predicted segmentation mask corresponding to the sample image data, obtains the third loss function of the large model.

[0173] It should be noted that there are no strict limitations on the third loss function. For example, it may include caustic loss function, Dice loss function, IoU loss function, CE, MSE, KL divergence, contrastive loss function, etc.

[0174] In some embodiments, a third loss function for the large model is obtained based on the predicted segmentation mask corresponding to the sample image data. This includes obtaining the sample segmentation mask corresponding to the sample image data and obtaining the third loss function based on the predicted segmentation mask and the sample segmentation mask. The sample segmentation mask can refer to a manually annotated, true segmentation mask corresponding to the sample image data.

[0175] In some embodiments, a third loss function of the large model is obtained based on the predicted segmentation mask corresponding to the sample image data, including obtaining a seventh loss function of the large model based on the predicted probability that the predicted segmentation mask belongs to the target category, obtaining an eighth loss function of the large model based on the predicted segmentation mask and the sample segmentation mask, and obtaining a third loss function based on at least one of the seventh and eighth loss functions.

[0176] In some embodiments, taking the caustic loss function as an example, the seventh loss function L7 is as follows: L7=-a t (1-p t2 ) γ log(p t2 )

[0177] Where, p t2 To predict the probability that the segmentation mask belongs to the target category.

[0178] In some embodiments, taking the Dice loss function as the eighth loss function as an example, the eighth loss function L8 is as follows:

[0179] Where A′ is the predicted segmentation mask, B′ is the sample segmentation mask, |A′∩B′| is the intersection area between A′ and B′, |A′| is the area of A′, and |B′| is the area of B′.

[0180] In some embodiments, a third loss function is obtained based on at least one of the seventh loss function and the eighth loss function, including using the seventh loss function as the third loss function, or using the eighth loss function as the third loss function, or performing a weighted summation of the seventh loss function and the eighth loss function to obtain the third loss function.

[0181] S604. Based on at least one of the predicted classification result, predicted bounding box, and predicted segmentation mask corresponding to the sample image data, obtain the predicted target detection result corresponding to the sample image data.

[0182] S605, based on the predicted target detection results, yields the fourth loss function for the large model.

[0183] It should be noted that there are no strict limitations on the fourth loss function. For example, it may include caustic loss function, Dice loss function, IoU loss function, CE, MSE, KL divergence, contrastive loss function, etc.

[0184] In some embodiments, a fourth loss function for the large model is obtained based on the predicted target detection results. This includes obtaining the sample target detection results corresponding to the sample image data, and obtaining the fourth loss function based on the predicted target detection results and the sample target detection results. The sample target detection results can refer to manually labeled, real target detection results corresponding to the sample image data.

[0185] In some embodiments, a fourth loss function of the large model is obtained based on the predicted target detection results. This includes a fifth loss function based on the number of fracture points contained in the predicted weld trajectory, a sixth loss function based on the target distance between two adjacent fracture points in the predicted weld trajectory, and a fourth loss function based on at least one of the fifth and sixth loss functions. Therefore, in weld detection scenarios, considering the number of fracture points contained in the predicted weld trajectory and / or the target distance between fracture points to obtain the fourth loss function helps improve the continuity of the weld trajectory.

[0186] It should be noted that the target distance refers to the distance between two adjacent fracture points included in the predicted weld trajectory.

[0187] In some embodiments, the fifth loss function L5 is as follows: L5 = N dis =|D|-1

[0188] Where, N dis To predict the number of breakpoints contained in the weld trajectory, |D| represents the number of connected segments contained in the predicted weld trajectory.

[0189] In some examples, a sixth loss function for the large model is obtained based on the target distance between two adjacent fracture points contained in the predicted weld trajectory. This includes obtaining the sum of multiple target distances corresponding to the predicted weld trajectory as the sixth loss function.

[0190] In some embodiments, the sixth loss function L6 is as follows:

[0191] Among them, D k To predict the k-th connected segment contained in the weld trajectory, D k+1 To predict the (k+1)th connected segment contained in the weld trajectory, let x be D k The position point on the top, y is Dk+1 The position point on the line is min‖xy‖, which is the Euclidean distance between two adjacent break points.

[0192] It should be noted that the relevant content of the fourth loss function is obtained based on at least one of the fifth and sixth loss functions, and the relevant content of the third loss function is obtained based on at least one of the seventh and eighth loss functions in the above embodiment, which will not be repeated here.

[0193] S606 trains a large model based on at least two of the first, second, third, and fourth loss functions.

[0194] In some embodiments, the large model is trained based on at least two of the first loss function, the second loss function, the third loss function, and the fourth loss function, including obtaining the total loss function of the large model based on at least two of the first loss function, the second loss function, the third loss function, and the fourth loss function, and training the large model based on the total loss function.

[0195] It should be noted that the total loss function of the large model is obtained based on at least two of the first, second, third, and fourth loss functions. For details on obtaining the third loss function based on at least one of the seventh and eighth loss functions in the above embodiments, please refer to the relevant content of obtaining the third loss function based on at least one of the seventh and eighth loss functions. It will not be repeated here.

[0196] In summary, the model training method according to the embodiments of this application can comprehensively consider at least two of the first loss function, the second loss function, the third loss function, and the fourth loss function to train the large model. This means that at least two of the classification loss, bounding box loss, segmentation mask loss, and detection result loss can be comprehensively considered to jointly optimize the large model. The loss functions of different tasks can jointly guide the learning process of the large model, which helps to improve the accuracy of the large model in image classification tasks, bounding box generation tasks, and mask generation tasks.

[0197] To achieve the above embodiments, this application also proposes a target detection device based on a large model.

[0198] Figure 7 is a schematic diagram of the structure of a target detection device based on a large model provided in an embodiment of this application.

[0199] As shown in Figure 7, the target detection device 100 based on a large model (hereinafter referred to as the target detection device) includes: an acquisition module 110, a processing module 120, a conversion module 130, a fusion module 140, and a detection module 150.

[0200] The acquisition module 110 is used to acquire multimodal object data, wherein the multimodal object data includes first image data of the object and first point cloud data of the object;

[0201] Processing module 120 is used to input the multimodal object data into a large model, wherein the large model includes a first encoder;

[0202] The conversion module 130 is used to perform at least one of coordinate transformation and dimension transformation on the object data of the first modality through the first encoder to obtain the target transformation data of the first modality;

[0203] The fusion module 140 is used to fuse the target conversion data of the first modality and the object data of the second modality through the first encoder to obtain fused data;

[0204] The detection module 150 is used to perform target detection based on the fused data through the remaining networks other than the first encoder in the large model.

[0205] Furthermore, in one possible implementation of this application embodiment, the conversion module 130 is further configured to: respond to the inconsistency between the first coordinate system of the object data of the first modality and the second coordinate system of the object data of the second modality, perform coordinate transformation on the object data of the first modality through the first encoder to obtain first converted data in the second coordinate system; respond to the inconsistency between the dimension of the first converted data and the dimension of the object data of the second modality, perform dimension transformation on the first converted data through the first encoder to obtain second converted data, which serves as the target converted data of the first modality, wherein the dimension of the second converted data is consistent with the dimension of the object data of the second modality; respond to the consistency between the dimension of the first converted data and the dimension of the object data of the second modality, use the first converted data as the target converted data of the first modality through the first encoder.

[0206] Furthermore, in one possible implementation of this application embodiment, the conversion module 130 is further configured to: in response to the inconsistency between the dimension of the object data of the first modality and the dimension of the object data of the second modality, perform dimension conversion on the object data of the first modality through the first encoder to obtain third converted data, wherein the dimension of the third converted data is consistent with the dimension of the object data of the second modality; in response to the inconsistency between the third coordinate system of the third converted data and the second coordinate system of the object data of the second modality, perform coordinate conversion on the third converted data through the first encoder to obtain fourth converted data in the second coordinate system, which serves as the target converted data of the first modality; and in response to the consistency between the third coordinate system and the second coordinate system, use the third converted data as the target converted data of the first modality through the first encoder.

[0207] Furthermore, in one possible implementation of this application embodiment, the conversion module 130 is further configured to: perform coordinate transformation on the first point cloud data through the first encoder to obtain second point cloud data in the camera coordinate system; perform dimension transformation on the second point cloud data through the first encoder to obtain third point cloud data, which serves as target conversion data for point cloud modality, wherein the dimension of the third point cloud data is consistent with the dimension of the first image data.

[0208] Furthermore, in one possible implementation of this application embodiment, the fusion module 140 is further configured to: extract features from the first image data using the first encoder to obtain a first feature; extract features from the third point cloud data using the first encoder to obtain a second feature; and stitch the first feature and the second feature together using the first encoder to obtain the fused data.

[0209] Furthermore, in one possible implementation of this application embodiment, the fusion module 140 is further configured to: segment the first image data using the first encoder to obtain second image data for each of N image blocks, where N is a positive integer; and extract features from the second image data of the i-th image block using the first encoder to obtain a first feature of the i-th image block, where i is a positive integer not greater than N.

[0210] The fusion module 140 is further configured to: segment the third point cloud data using the first encoder to obtain fourth point cloud data for each of N image blocks; and extract features from the fourth point cloud data of the i-th image block using the first encoder to obtain a second feature of the i-th image block.

[0211] Furthermore, in one possible implementation of this application embodiment, the fusion module 140 is further configured to: stitch together the first feature and the second feature of the i-th image block using the first encoder to obtain the fused data of the i-th image block.

[0212] Furthermore, in one possible implementation of this application embodiment, the large model further includes a feature extraction network, a decoder, a second encoder, and a multilayer perceptron;

[0213] The detection module 150 is further configured to: extract features from the fused data through the feature extraction network to obtain a third feature; perform target detection based on the third feature through the decoder to obtain a fourth feature; extract features from the third feature and the fused data through the second encoder to obtain a fifth feature; and obtain a segmentation mask corresponding to the first image data through the multilayer perceptron based on the fourth feature and the fifth feature.

[0214] Furthermore, in one possible implementation of this application embodiment, the feature extraction network includes a backbone network, a position encoder, and a third encoder;

[0215] The detection module 150 is further configured to: extract features from the fused data through the backbone network to obtain a sixth feature; encode the sixth feature through the position encoder to obtain position encoded data of the sixth feature; and extract features from the sixth feature and the position encoded data through the third encoder to obtain the third feature.

[0216] Furthermore, in one possible implementation of this application embodiment, the detection module 150 is further configured to: obtain a classification result corresponding to the first image data based on the fourth feature using the multilayer perceptron; and obtain a bounding box corresponding to the first image data based on the fourth feature using the multilayer perceptron.

[0217] Furthermore, in one possible implementation of this application embodiment, the target detection device 100 further includes: a training module, configured to: obtain a first loss function of the large model based on the predicted classification result corresponding to the sample image data; obtain a second loss function of the large model based on the predicted bounding box corresponding to the sample image data; obtain a third loss function of the large model based on the predicted segmentation mask corresponding to the sample image data; obtain a predicted target detection result corresponding to the sample image data based on at least one of the predicted classification result, predicted bounding box, and predicted segmentation mask corresponding to the sample image data; obtain a fourth loss function of the large model based on the predicted target detection result; and train the large model based on at least two of the first loss function, the second loss function, the third loss function, and the fourth loss function.

[0218] Furthermore, in one possible implementation of this application embodiment, the training module is further configured to: obtain a fifth loss function of the large model based on the number of fracture points contained in the predicted weld trajectory; obtain a sixth loss function of the large model based on the target distance between two adjacent fracture points contained in the predicted weld trajectory; and obtain a fourth loss function based on at least one of the fifth loss function and the sixth loss function.

[0219] Furthermore, in one possible implementation of this application embodiment, the training module is further configured to: obtain the sum of multiple target distances corresponding to the predicted weld trajectory, as the sixth loss function.

[0220] It should be noted that the foregoing explanation of the target detection method embodiment based on large models also applies to the target detection device based on large models in this embodiment, and will not be repeated here.

[0221] In summary, the target detection device based on a large model in this application embodiment can use multimodal data for target detection, which improves the robustness, generalization and accuracy of target detection. It can use the first encoder to perform at least one of the following processes on the object data of a certain modality: coordinate transformation and dimension transformation, to obtain the target transformation data of that modality, so as to realize multimodal data fusion. Then, the remaining networks of the large model are used to perform target detection based on the fused data, which simplifies user operation, improves target detection efficiency, optimizes the user experience in target detection scenarios, and is suitable for weld inspection scenarios.

[0222] To implement the above embodiments, this application also proposes an electronic device, including: a processor and a memory communicatively connected to the processor; the memory stores computer execution instructions; the processor executes the computer execution instructions stored in the memory to implement the target detection method based on a large model provided in the foregoing embodiments.

[0223] To implement the above embodiments, this application also proposes a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the target detection method based on a large model provided in the foregoing embodiments.

[0224] To implement the above embodiments, this application also proposes a computer program product, including a computer program that, when executed by a processor, implements the target detection method based on a large model provided in the foregoing embodiments.

[0225] The electronic device, computer-readable storage medium, and computer program product provided in this application can perform target detection using multimodal data, improving the robustness, generalization, and accuracy of target detection. It can use a first encoder to perform at least one of the following processes on object data of a certain modality: coordinate transformation and dimension transformation, to obtain target transformation data of that modality, thereby realizing multimodal data fusion. Target detection is then performed based on the fused data through the remaining networks of a large model, simplifying user operation, improving target detection efficiency, and optimizing the user experience in target detection scenarios. It is suitable for weld seam inspection scenarios.

[0226] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in this application all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0227] It should be noted that personal information collected from users should be used for legitimate and reasonable purposes and should not be shared or sold outside of these legitimate uses. Furthermore, such collection / sharing should only be conducted after receiving the user's informed consent, including but not limited to notifying the user to read the user agreement / user notice and sign an agreement / authorization that includes authorization of relevant user information before the user uses the function. In addition, any necessary steps must be taken to protect and safeguard access to such personal information data and ensure that others with access to personal information data comply with their privacy policies and procedures.

[0228] This application is intended to provide an implementation scheme for users to selectively prevent the use or access to their personal information data. Specifically, this disclosure is intended to provide hardware and / or software to prevent or block access to such personal information data. Once personal information data is no longer needed, risks can be minimized by restricting data collection and deleting data. Furthermore, where applicable, such personal information is de-identified to protect user privacy.

[0229] In the foregoing descriptions of the embodiments, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0230] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as controlling or implying relative importance or implicitly specifying the number of technical features controlled. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0231] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.

[0232] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.

[0233] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0234] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

[0235] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

[0236] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.

[0237] All embodiments disclosed herein can be executed individually or in combination with other embodiments, and are all considered to be within the scope of protection claimed by this disclosure.

Claims

1. A large model-based target detection method, comprising: obtaining multi-modal object data, wherein the multi-modal object data comprises first image data of an object and first point cloud data of the object; inputting the multi-modal object data into a large model, wherein the large model comprises a first encoder; performing at least one of coordinate conversion and dimension conversion on the object data of the first modality by the first encoder to obtain target conversion data of the first modality; fusing the target conversion data of the first modality and the object data of the second modality by the first encoder to obtain fusion data; performing target detection based on the fusion data by the remaining network of the large model other than the first encoder.

2. The method of claim 1, wherein, The at least one of coordinate conversion and dimension conversion on the object data of the first modality by a first encoder to obtain target conversion data of the first modality comprises: in response to the first coordinate system of the object data of the first modality being inconsistent with the second coordinate system of the object data of the second modality, performing coordinate conversion on the object data of the first modality by the first encoder to obtain first conversion data in the second coordinate system; in response to the dimension of the first conversion data being inconsistent with the dimension of the object data of the second modality, performing dimension conversion on the first conversion data by the first encoder to obtain second conversion data as the target conversion data of the first modality, wherein the dimension of the second conversion data is consistent with the dimension of the object data of the second modality; in response to the dimension of the first conversion data being consistent with the dimension of the object data of the second modality, taking the first conversion data as the target conversion data of the first modality by the first encoder.

3. The method of claim 1, wherein, The at least one of coordinate conversion and dimension conversion on the object data of a first modality by a first encoder to obtain target conversion data of the first modality comprises: In response to the dimension of the object data of the first modality being inconsistent with the dimension of the object data of the second modality, performing dimension conversion on object data of the first modality by the first encoder to obtain third conversion data, wherein the dimension of the third conversion data is consistent with the dimension of the object data of the second modality; in response that the third coordinate system of the third conversion data is inconsistent with the second coordinate system of the object data of the second modality, performing coordinate conversion of the third conversion data by the first encoder to obtain fourth conversion data in the second coordinate system as the target conversion data of the first modality; in response to the third coordinate system being consistent with the second coordinate system, taking the third conversion data as the target conversion data of the first modality by the first encoder.

4. The method of claim 1, wherein, The at-least-one-of coordinate-conversion and dimension-conversion on the object data of the first modality by the first encoder to obtain the target conversion data of the first modality comprises: performing coordinate conversion on the first point cloud data by the first encoder to obtain second point cloud data in a camera coordinate system; The second point cloud data is dimensionally converted by the first encoder to obtain third point cloud data as target conversion data of the point cloud modality, wherein the dimension of the third point cloud data is consistent with the dimension of the first image data.

5. The method of claim 4, wherein, The fusion data is obtained by fusing the target conversion data of the first modality and the object data of the second modality through the first encoder. The first feature is obtained by performing feature extraction on the first image data through the first encoder. The second feature is obtained by performing feature extraction on the third point cloud data through the first encoder. The fusion data is obtained by splicing the first feature and the second feature through the first encoder.

6. The method of claim 5, wherein, The first feature is obtained by performing feature extraction on the first image data through the first encoder, including: The second image data of each of the N image blocks is obtained by segmenting the first image data through the first encoder, wherein N is a positive integer. The first feature of the i-th image block is obtained by performing feature extraction on the second image data of the i-th image block through the first encoder, wherein i is a positive integer not greater than N. The second feature is obtained by performing feature extraction on the third point cloud data through the first encoder, including: The fourth point cloud data of each of the N image blocks is obtained by segmenting the third point cloud data through the first encoder. The second feature of the i-th image block is obtained by performing feature extraction on the fourth point cloud data of the i-th image block through the first encoder.

7. The method of claim 6, wherein, The fusion data is obtained by splicing the first feature of the i-th image block and the second feature of the i-th image block through the first encoder. The large model further includes a feature extraction network, a decoder, a second encoder, and a multi-layer perceptron.

8. The method of any one of claims 1-7, wherein, The target detection is performed based on the fusion data through the remaining network in the large model except the first encoder, including: The third feature is obtained by performing feature extraction on the fusion data through the feature extraction network. The target detection is performed based on the third feature through the decoder to obtain the fourth feature. The fifth feature is obtained by performing feature extraction on the third feature and the fusion data through the second encoder. The segmentation mask corresponding to the first image data is obtained based on the fourth feature and the fifth feature through the multi-layer perceptron. The feature extraction network includes a backbone network, a position encoder, and a third encoder.

9. The method of claim 8, wherein, The third feature is obtained by performing feature extraction on the fusion data through the feature extraction network, including: The sixth feature is obtained by performing feature extraction on the fusion data through the backbone network. The position encoding data of the sixth feature is obtained by performing position encoding on the sixth feature through the position encoder. The third feature is obtained by performing feature extraction on the sixth feature and the position encoding data through the third encoder. The method further includes:

10. The method of claim 8 or 9, wherein, obtaining, by the multilayer perceptron, a classification result corresponding to the first image data based on the fourth feature; obtaining, by the multilayer perceptron, a bounding box corresponding to the first image data based on the fourth feature.

11. The method of claim 10, wherein, The method further comprises: obtaining a first loss function of the large model based on a predicted classification result corresponding to sample image data; obtaining a second loss function of the large model based on a predicted bounding box corresponding to the sample image data; obtaining a third loss function of the large model based on a predicted segmentation mask corresponding to the sample image data; obtaining a predicted target detection result corresponding to the sample image data based on at least one of the predicted classification result, the predicted bounding box and the predicted segmentation mask; obtaining a fourth loss function of the large model based on the predicted target detection result; training the large model based on at least two of the first loss function, the second loss function, the third loss function and the fourth loss function.

12. The method of claim 11, wherein, The obtaining, by the large model, the fourth loss function based on the predicted target detection result comprises: obtaining a fifth loss function of the large model based on a number of break points contained in a predicted weld seam trajectory; obtaining a sixth loss function of the large model based on a target distance between two adjacent break points contained in the predicted weld seam trajectory; obtaining the fourth loss function based on at least one of the fifth loss function and the sixth loss function.

13. The method of claim 12, wherein, The obtaining, by the large model, the sixth loss function based on the target distance between the two adjacent break points contained in the predicted weld seam trajectory comprises: obtaining a sum value of a plurality of target distances corresponding to the predicted weld seam trajectory as the sixth loss function.

14. A large model-based target detection device, comprising: an acquisition module configured to acquire multi-modal object data, wherein the multi-modal object data comprises first image data of an object and first point cloud data of the object; a processing module configured to input the multi-modal object data into a large model, wherein the large model comprises a first encoder; a conversion module configured to perform at least one of coordinate conversion and dimension conversion on the object data of the first modality by the first encoder to obtain target conversion data of the first modality; a fusion module configured to fuse the target conversion data of the first modality and the object data of the second modality by the first encoder to obtain fusion data; a detection module configured to perform target detection based on the fusion data by a remaining network of the large model other than the first encoder.

15. An electronic device comprising: a processor, and a memory connected to the processor in communication; the memory stores computer execution instructions; the processor executes the computer execution instructions stored in the memory to implement the method of any one of claims 1-13.

16. A computer readable storage medium, wherein, The computer readable storage medium stores computer execution instructions, and the computer execution instructions are executed by the processor to implement the method of any one of claims 1-13.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-13.