Open scene oriented multi-modal visual target retrieval method and system

By generating multimodal representation data of visual objects and constructing a hypergraph structure, learning higher-order association representations, and generating embedding representations of unknown categories, the difficulties of multimodal fusion and association of unknown categories in visual object retrieval in open scenarios are solved, thereby improving retrieval performance.

CN116304147BActive Publication Date: 2026-06-16TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2022-09-07
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies suffer from multimodal representation semantic gaps, difficulty in modeling the association between unknown category visual objects and known categories, lack of generalization in feature extraction, and cognitive impairment in modeling in visual object retrieval in open scenarios, resulting in poor retrieval performance.

Method used

By generating multimodal representation data of visual objects, projecting it into a compact latent space, constructing a hypergraph structure, learning high-order association representations of open and known categories, and using a canonical representation memory module to generate embedding representations of unknown categories, the model is trained using known categories for retrieval.

🎯Benefits of technology

It overcomes the semantic gap between different modal representations, improves the retrieval performance for objects of unknown categories, and enhances the retrieval accuracy and robustness in open scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116304147B_ABST
    Figure CN116304147B_ABST
Patent Text Reader

Abstract

The application relates to a multi-modal visual target retrieval method and system for an open scene, comprising: generating multi-modal representation data of a visual object, projecting the multi-modal representation data to a compact latent space of the visual object to obtain compact representation data of the visual object; constructing a hypergraph structure based on the compact representation data, learning high-order correlation representation of open categories and known categories through the hypergraph structure; generating embedding representation data of unknown categories through a preset typical representation memory module based on the high-order correlation representation, applying the embedding representation data of the unknown categories to visual target retrieval in an open scene by using a preset known category training model, and obtaining visual objects of the unknown categories. Thus, the natural semantic gap caused by different original data formats and learning network designs is solved, and the problems that the retrieval method of the related art cannot predict unknown category objects and the like are solved, the semantic gap of different modal representations is overcome, and the embedding representation of unknown categories can be predicted.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of visual target retrieval technology, and in particular to a multimodal visual target retrieval method and system for open scenarios. Background Technology

[0002] Visual targets typically have multiple modal representations, such as point clouds, voxels, and multiple views. However, due to the completely different original data formats and learning network designs, these representations inherently create semantic gaps. Furthermore, traditional retrieval methods perform poorly in open-scene visual target retrieval because they cannot predict objects of unknown categories. Real life is filled with a large number of objects of unknown categories, which negatively impacts the performance of existing closed-set trained visual target retrieval algorithms. Therefore, this application aims to research a visual target retrieval system for open scenes to expand the application scenarios of existing retrieval algorithms and improve the robustness of retrieval performance when dealing with objects of unknown categories.

[0003] However, the main challenges for visual target retrieval methods in open scenarios are: (1) the difficulty of multimodal fusion caused by the semantic gap of multimodal representation; (2) the difficulty in modeling the association between visual objects of unknown categories and visual objects of known categories; (3) the lack of generalization of features extracted by visual object feature extractors; and (4) the model cognitive barrier caused by the unknown category of the retrieved visual target during training. Summary of the Invention

[0004] This application provides a multimodal visual target retrieval method and system for open scenarios, which solves the problems of the natural semantic gap caused by different original data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories. It overcomes the semantic gap of different modal representations, and can infer the embedding representation of unknown categories through the memory network using the typical representation of known categories, thereby improving the retrieval performance of unknown category objects in open scenarios.

[0005] The first aspect of this application provides a multimodal visual target retrieval method for open scenes, comprising the following steps: generating multimodal representation data of a visual object, and projecting the multimodal representation data onto the compact latent space of the visual object to obtain compact representation data of the visual object; constructing a hypergraph structure based on the compact representation data, and learning higher-order association representations of open categories and known categories through the hypergraph structure; generating embedding representation data of unknown categories based on the higher-order association representations of open categories and known categories through a preset typical representation memory module, and applying the embedding representation data of unknown categories to visual target retrieval in open scenes using a preset known category training model to obtain visual objects of unknown categories.

[0006] Optionally, generating multimodal representation data of a visual object includes: configuring the acquisition environment for the multimodal representation data; based on the acquisition environment, outputting multiple modal representations of the visual object through a preset capture device, and extracting the basic features of the multiple modal representations to obtain the multimodal representation data.

[0007] Optionally, projecting the multimodal representation data onto the compact latent space of the visual object to obtain the compact representation data of the visual object includes: constructing an autoencoder for the multimodal representation data; training the autoencoder for the multimodal representation data based on a preset loss function to obtain a multimodal autoencoder; and projecting the multimodal representation data onto the compact latent space of the visual object through the multimodal autoencoder to obtain the compact representation data of the visual object.

[0008] Optionally, the step of constructing a hypergraph structure based on the compact representation data and learning higher-order association representations of open and known categories through the hypergraph structure includes: constructing the hypergraph structure using a preset K-nearest neighbor algorithm based on the compact representation data; and learning the node features of the hypergraph structure based on a preset hypergraph convolution iteration formula to obtain higher-order association representations of the open and known categories.

[0009] Optionally, the step of generating embedded representation data for unknown categories based on the higher-order association representations of the open categories and the known categories through a preset typical representation memory module includes: calculating the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; and reconstructing the compact representation data according to each memory anchor and the activation score of each memory anchor to obtain the embedded representation data for the unknown categories.

[0010] Optionally, before generating the embedded representation data of the unknown category through the preset typical representation memory module, the method further includes: constructing a typical representation memory network for the visual object; training and updating the typical representation memory network based on a preset memory reconstruction loss function to obtain the preset typical representation memory module.

[0011] A second aspect of this application provides a multimodal visual target retrieval system for open scenes, comprising: a generation module for generating multimodal representation data of visual objects and projecting the multimodal representation data onto the compact latent space of the visual objects to obtain compact representation data of the visual objects; an association module for constructing a hypergraph structure based on the compact representation data and learning higher-order association representations of open categories and known categories through the hypergraph structure; and a retrieval module for generating embedding representation data of unknown categories based on the higher-order association representations of open categories and known categories through a preset typical representation memory module, and applying the embedding representation data of unknown categories to visual target retrieval in open scenes using a preset known category training model to obtain visual objects of unknown categories.

[0012] Optionally, the generation module is specifically used to: configure the acquisition environment of the multimodal representation data; based on the acquisition environment, output multiple modal representations of the visual object through a preset capture device, and extract the basic features of the multiple modal representations to obtain the multimodal representation data.

[0013] Optionally, the generation module is further configured to: construct an autoencoder for the multimodal representation data; train the autoencoder for the multimodal representation data based on a preset loss function to obtain a multimodal autoencoder; and project the multimodal representation data onto the compact latent space of the visual object through the multimodal autoencoder to obtain the compact representation data of the visual object.

[0014] Optionally, the association module is specifically used to: construct the hypergraph structure based on the compact representation data using a preset K-nearest neighbor algorithm; and learn the node features of the hypergraph structure based on a preset hypergraph convolution iteration formula to obtain the high-order association representation of the open category and the known category.

[0015] Optionally, the retrieval module is specifically used to: calculate the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; and reconstruct the compact representation data based on each memory anchor and the activation score of each memory anchor to obtain the embedded representation data of the unknown category.

[0016] Optionally, before generating the embedded representation data of the unknown category through the preset typical representation memory module, the retrieval module is further configured to: construct the typical representation memory network of the visual object; train and update the typical representation memory network based on a preset memory reconstruction loss function to obtain the preset typical representation memory module.

[0017] A third aspect of this application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the multimodal visual target retrieval method for open scenes as described in the above embodiments.

[0018] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which is executed by a processor to implement the multimodal visual target retrieval method for open scenes as described in the above embodiments.

[0019] Therefore, by generating multimodal representation data of visual objects, projecting this data onto the compact latent space of the visual objects, and constructing a hypergraph structure, we learn higher-order association representations between open and known categories. Furthermore, we generate embedding representation data for unknown categories using a pre-defined canonical representation memory module. By training a model with pre-defined known categories, we apply the embedding representation data of unknown categories to visual object retrieval in open scenes, thus obtaining visual objects of unknown categories. This approach solves the problems of inherent semantic gaps caused by different raw data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories. It overcomes the semantic gap between different modal representations, enabling the memory network to infer the embedding representations of unknown categories from the canonical representations of known categories, thereby improving the retrieval performance for objects of unknown categories in open scenes.

[0020] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0021] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0022] Figure 1 This is a flowchart illustrating a multimodal visual target retrieval method for open scenarios provided according to an embodiment of this application;

[0023] Figure 2 This is a schematic diagram of a retrieval task in an open scenario according to an embodiment of this application;

[0024] Figure 3 This is a schematic diagram of a visual target retrieval system architecture for open scenarios according to an embodiment of this application;

[0025] Figure 4 This is a block diagram of a multimodal visual target retrieval system for open scenarios according to an embodiment of this application;

[0026] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0027] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.

[0028] The following describes a multimodal visual target retrieval method and system for open scenes according to embodiments of this application, with reference to the accompanying drawings. Addressing the issues mentioned in the background section, such as the inherent semantic gap caused by differences in original data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories, this application provides a multimodal visual target retrieval method for open scenes. In this method, multimodal representation data of visual objects is generated, projected onto the compact latent space of the visual objects to obtain compact representation data of the visual objects, and a hypergraph structure is constructed. High-order association representations of open and known categories are learned through the hypergraph structure, and embedding representation data of unknown categories is generated through a preset typical representation memory module. A model is trained using preset known categories, and the embedding representation data of unknown categories is applied to visual target retrieval in open scenes to obtain visual objects of unknown categories. This solves the problems caused by the natural semantic gap due to the different original data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories. It overcomes the semantic gap of different modal representations and can infer the embedding representation of unknown categories through the memory network using the typical representation of known categories, thus improving the retrieval performance of objects of unknown categories in open scenarios.

[0029] Specifically, Figure 1 This is a flowchart illustrating a multimodal visual target retrieval method for open scenarios, provided as an embodiment of this application.

[0030] like Figure 1 As shown, this multimodal visual target retrieval method for open scenes includes the following steps:

[0031] In step S101, multimodal representation data of the visual object is generated, and the multimodal representation data is projected onto the compact latent space of the visual object to obtain compact representation data of the visual object.

[0032] Optionally, in some embodiments, generating multimodal representation data of a visual object includes: configuring a data acquisition environment for the multimodal representation data; based on the acquisition environment, outputting multiple modal representations of the visual object through a preset capture device, and extracting the basic features of the multiple modal representations to obtain multimodal representation data.

[0033] It should be understood that the embodiments of this application can generate multimodal representation data of visual objects. It is necessary to configure the acquisition environment of multimodal representation data, and output multiple modal representations of visual objects and extract the basic features of the modalities through the capture of different modal representation data to obtain multimodal representation data.

[0034] Optionally, in some embodiments, projecting the multimodal representation data onto the compact latent space of the visual object to obtain the compact representation data of the visual object includes: constructing an autoencoder for the multimodal representation data; training the autoencoder for the multimodal representation data based on a preset loss function to obtain a multimodal autoencoder; and projecting the multimodal representation data onto the compact latent space of the visual object through the multimodal autoencoder to obtain the compact representation data of the visual object.

[0035] It is understood that the embodiments of this application can project multimodal representation data onto the compact latent space of visual objects to obtain compact representation data of visual objects. This requires constructing autoencoders for different modalities, constructing and training a multimodal autoencoder loss function, fusing the compact representations of multimodalities, and generating compact representation data of visual objects.

[0036] In step S102, a hypergraph structure is constructed based on the compact representation data, and higher-order association representations of open categories and known categories are learned through the hypergraph structure.

[0037] Optionally, in some embodiments, a hypergraph structure is constructed based on the compact representation data, and higher-order association representations of open categories and known categories are learned through the hypergraph structure, including: constructing a hypergraph structure based on the compact representation data using a preset K-nearest neighbor algorithm; and learning the node features of the hypergraph structure based on a preset hypergraph convolution iteration formula to obtain higher-order association representations of open categories and known categories.

[0038] Specifically, for compact representation data of visual objects, embodiments of this application can use the K-nearest neighbor algorithm to construct a hypergraph structure. Furthermore, hypergraph convolution is used to learn node features and hypergraph structure to obtain high-order association representations of open and known categories.

[0039] In step S103, based on the high-order association representations of open and known categories, embedding representation data of unknown categories is generated through a preset typical representation memory module. The embedding representation data of unknown categories is then applied to visual target retrieval in open scenes using a preset known category training model to obtain visual objects of unknown categories.

[0040] Optionally, in some embodiments, based on the higher-order association representations of open and known categories, embedding representation data of unknown categories is generated through a preset typical representation memory module, including: calculating the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; and reconstructing the compact representation data according to each memory anchor and the activation score of each memory anchor to obtain the embedding representation data of unknown categories.

[0041] Specifically, in this embodiment, the compact representation data of the current data object can be input, and its activation score with each memory anchor in the memory module can be calculated. The compact representation of the current visual object can be reconstructed through the memory anchor and its activation score, and embedded representation data of unknown category can be generated.

[0042] Optionally, in some embodiments, before generating embedded representation data of unknown categories through a preset typical representation memory module, the method further includes: constructing a typical representation memory network for visual objects; training and updating the typical representation memory network based on a preset memory reconstruction loss function to obtain a preset typical representation memory module.

[0043] In this application embodiment, it is necessary to construct a typical representation memory network for visual objects. A training set is constructed, and the memory network is trained and updated using the memory reconstruction loss function to obtain the preset typical label memory modules.

[0044] Furthermore, in this embodiment, a multimodal representation can be constructed from visual objects of known categories to iteratively infer the system and update the trainable parameters of each layer until the network converges. For visual objects of unknown categories, the system extracts their unknown category embedding representation data and compares it with visual objects in the retrieval set to return similar visual objects.

[0045] To enable those skilled in the art to further understand the multimodal visual target retrieval method for open scenes in the embodiments of this application, the following is combined with Figure 2 and Figure 3 To elaborate in detail, among which, Figure 2 This is a schematic diagram of a retrieval task in an open scenario according to one embodiment of this application. Figure 3 This is a schematic diagram of the architecture of a visual target retrieval system for open scenarios according to an embodiment of this application.

[0046] First, the multimodal representations of stereo vision objects are encoded separately, and the embedded representations of the multimodal representations are extracted.

[0047] The embodiment of this application configures a multimodal representation data acquisition environment. Three modalities are used to represent a visual object: multi-view, point cloud, and voxel. For the multi-view modality, the embodiment first scales the object's 3D coordinates to the range [-0.5, 0.5], and places a 15×15 plane at position (0, 0, -0.7). Twelve virtual cameras are evenly arranged around the object to capture images. For the point cloud modality, the embodiment accumulates and sorts the surface area blocks of the visual object, and randomly generates several 3D points on top of them to construct the point cloud modality's point set representation. For the voxel modality, the embodiment uses unit cubes to evenly divide the 3D coordinates of the visual object. For each unit cube, if it intersects with the surface of the visual object, its value at that position is set to 1; otherwise, it is set to 0.

[0048] Specifically, the capturer for different modal representations of the data outputs multiple modal representations of the visual object and extracts the basic features of the modalities. Based on the above multimodal generation configuration, multi-view representations, point cloud representations, and voxel representations are generated respectively. Furthermore, this embodiment constructs three basic feature extractors. Specifically, MVCNN is used to extract multi-view basic features, PointNet (a three-cloud point network) is used to extract point cloud basic features, and ShapeNets models are used to extract voxel basic features. The extracted basic features are represented as follows:

[0049] Then, the multimodal representation is projected onto the compact space of the visual object through a multimodal autoencoder.

[0050] In this embodiment, autoencoders for different modalities are constructed. For any modality, the autoencoder of Equation (1) is used to project it onto a compact latent space and reconstruct it.

[0051]

[0052] Where, m k Let m represent the fundamental features of the k-th modality representation. o This represents a compact representation of a visual object based on modality k.

[0053] In this embodiment, a multimodal autoencoder loss function is constructed and trained. For each modality, the autoencoder described above is constructed. Furthermore, two sets of loss functions, equations (2) and (3), are constructed to guide the training of the multimodal autoencoder, as follows:

[0054]

[0055]

[0056] The first loss function is used to ensure that the distance between the representations of the same visual object based on different modalities is as close as possible, and the second loss function is used to ensure that the autoencoder can effectively perform single-modal reconstruction and cross-modal reconstruction.

[0057] In this embodiment, a compact representation of a visual object is generated by fusing the compact representations of multimodal visual objects. The embodiment uses an average aggregation function (4) to fuse the compact representations of visual objects based on different modalities, generating a unique compact representation for each visual object, as shown in the following formula:

[0058]

[0059] in It is the average aggregation function. It is a matrix form consisting of a compact representation of N visual objects.

[0060] Then, a hypergraph network is constructed to learn higher-order association representations of open and known categories.

[0061] Specifically, the K-nearest neighbor algorithm is used to construct a hypergraph structure for the compact representation of visual objects.

[0062] Hypergraph convolution is used to learn node features and hypergraph structure. The hypergraph convolution iteration formula is shown in equation (5):

[0063]

[0064] Where D v and D e These are the node degree matrix and the hyperedge degree matrix in the hypergraph. These are the trainable parameters of the hypergraph convolution layer.

[0065] Then, the memory module is trained and typical representations of known categories are stored.

[0066] Among them, the typical representation memory module for constructing visual objects For each object representation, this embodiment calculates its activation score relative to each typical representation in the memory module using equation (6).

[0067]

[0068] in It is the activation score metric function.

[0069] In this process, a training set is constructed, and the memory modules are trained and updated using the memory reconstruction loss functions (7) and (8). The designed learning loss for the memory modules is shown below:

[0070]

[0071]

[0072] The first function is used to ensure that the embeddedness reconstructed from memory is as close as possible in space to the original visual object embeddedness.

[0073] Furthermore, embedding representations of unknown categories are generated using typical representations of known categories.

[0074] The input is a compact representation of the current data object, and its activation score with each memory anchor in the memory module is calculated using the following formula (9):

[0075]

[0076] The compact representation of the current visual object is reconstructed using memory anchors and their activation scores, and an embedding representation of the unknown category is generated using the formula of Equation (10) as follows:

[0077]

[0078] Where z i Embed a representation for the reconstructed visual object.

[0079] Finally, the model is trained using known categories and applied to visual object retrieval in open scenes.

[0080] In this process, multimodal representations are constructed from visual objects of known categories to iteratively infer the system and update the trainable parameters of each layer until the network converges.

[0081] Specifically, for visual objects of unknown categories, the system extracts their unknown category embedding representation and compares it with visual objects in the retrieval set, and returns similar visual objects.

[0082] In summary, the beneficial effects of the embodiments of this application are as follows:

[0083] (1) The embodiments of this application use multimodal autoencoders to fuse different modal representations. Compared with traditional direct feature concatenation, this method can reduce the feature dimension after fusion, increase the density of information, adapt to changes in the number of modalities, and improve the robustness of the fusion algorithm when modalities are missing.

[0084] (2) This application uses a hypergraph neural network to establish a high-order association between unknown categories and known categories. When faced with a retrieval problem of unknown categories, it can be transformed into a retrieval problem of several known categories, thereby greatly improving the retrieval performance in open scenarios.

[0085] (3) This application embodiment designs a memory module for typical representations of visual objects. During training, the memory module decomposes visual objects of known categories into several memory anchor points for typical representations. When encountering visual objects of unknown categories, it compares their visual embeddings with the memory anchor points, thereby improving the generalization ability of the embedding representation of unknown categories and further enhancing the retrieval accuracy of visual targets of unknown categories in open scenes.

[0086] The multimodal visual object retrieval method for open scenes proposed in this application generates multimodal representation data of visual objects, projects this multimodal representation data onto the compact latent space of the visual objects to obtain compact representation data of the visual objects, and constructs a hypergraph structure. Through the hypergraph structure, it learns high-order association representations between open and known categories, and generates embedding representation data of unknown categories through a preset typical representation memory module. Using a preset known category to train a model, the embedding representation data of unknown categories is applied to visual object retrieval in open scenes to obtain visual objects of unknown categories. This solves the problems of the inherent semantic gap caused by different original data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories. It overcomes the semantic gap between different modal representations, and can infer the embedding representation of unknown categories through the memory network using the typical representations of known categories, thus improving the retrieval performance of unknown category objects in open scenes.

[0087] Next, referring to the accompanying drawings, a multimodal visual target retrieval system for open scenarios proposed according to embodiments of this application is described.

[0088] Figure 4 This is a block diagram of a multimodal visual target retrieval system for open scenarios according to an embodiment of this application.

[0089] like Figure 4 As shown, the multimodal visual target retrieval system 10 for open scenarios includes: a generation module 100, an association module 200, and a retrieval module 300.

[0090] The generation module 100 generates multimodal representation data of visual objects and projects the multimodal representation data onto the compact latent space of the visual objects to obtain compact representation data of the visual objects. The association module 200 constructs a hypergraph structure based on the compact representation data and learns higher-order association representations of open categories and known categories through the hypergraph structure. The retrieval module 300 generates embedding representation data of unknown categories based on the higher-order association representations of open categories and known categories through a preset typical representation memory module, and uses a preset known category to train a model to apply the embedding representation data of unknown categories to visual target retrieval in open scenes to obtain visual objects of unknown categories.

[0091] Optionally, in some embodiments, the generation module 100 is specifically used to: configure the acquisition environment for multimodal representation data; based on the acquisition environment, output multiple modal representations of visual objects through a preset capture device, and extract the basic features of the multiple modal representations to obtain multimodal representation data.

[0092] Optionally, in some embodiments, the generation module 100 is further configured to: construct an autoencoder for the multimodal representation data; train the autoencoder for the multimodal representation data based on a preset loss function to obtain a multimodal autoencoder; and project the multimodal representation data onto the compact latent space of the visual object through the multimodal autoencoder to obtain the compact representation data of the visual object.

[0093] Optionally, in some embodiments, the association module 200 is specifically used to: construct a hypergraph structure based on compact representation data using a preset K-nearest neighbor algorithm; and learn the node features of the hypergraph structure based on a preset hypergraph convolution iteration formula to obtain high-order association representations of open and known categories.

[0094] Optionally, in some embodiments, the retrieval module 300 is specifically used to: calculate the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; and reconstruct the compact representation data based on each memory anchor and the activation score of each memory anchor to obtain the embedded representation data of the unknown category.

[0095] Optionally, in some embodiments, before generating embedded representation data of unknown categories through a preset typical representation memory module, the retrieval module 300 is further configured to: construct a typical representation memory network for visual objects; train and update the typical representation memory network based on a preset memory reconstruction loss function to obtain a preset typical representation memory module.

[0096] It should be noted that the foregoing explanation of the embodiment of the multimodal visual target retrieval method for open scenarios also applies to the multimodal visual target retrieval system for open scenarios in this embodiment, and will not be repeated here.

[0097] The multimodal visual object retrieval system for open scenes proposed in this application generates multimodal representation data of visual objects, projects this multimodal representation data onto the compact latent space of the visual objects to obtain compact representation data of the visual objects, and constructs a hypergraph structure. Through the hypergraph structure, it learns high-order association representations between open and known categories, and generates embedding representation data of unknown categories through a preset typical representation memory module. Using a preset known category to train a model, the embedding representation data of unknown categories is applied to visual object retrieval in open scenes to obtain visual objects of unknown categories. This solves the problems of the inherent semantic gap caused by different original data formats and learning network designs, and the inability of related retrieval methods to predict objects of unknown categories. It overcomes the semantic gap between different modal representations, and can infer the embedding representation of unknown categories through the memory network using the typical representations of known categories, thus improving the retrieval performance of unknown category objects in open scenes.

[0098] Figure 5 A schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device may include:

[0099] The memory 501, the processor 502, and the computer program stored on the memory 501 and capable of running on the processor 502.

[0100] When the processor 502 executes the program, it implements the multimodal visual target retrieval method for open scenes provided in the above embodiments.

[0101] Furthermore, electronic devices also include:

[0102] Communication interface 503 is used for communication between memory 501 and processor 502.

[0103] The memory 501 is used to store computer programs that can run on the processor 502.

[0104] The memory 501 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.

[0105] If the memory 501, processor 502, and communication interface 503 are implemented independently, then the communication interface 503, memory 501, and processor 502 can be interconnected via a bus to complete communication between them. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, Figure 5 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0106] Optionally, in a specific implementation, if the memory 501, processor 502, and communication interface 503 are integrated on a single chip, then the memory 501, processor 502, and communication interface 503 can communicate with each other through an internal interface.

[0107] Processor 502 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.

[0108] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described multimodal visual target retrieval method for open scenes.

[0109] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0110] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0111] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.

[0112] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.

[0113] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0114] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

[0115] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

[0116] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.

Claims

1. A multimodal visual target retrieval method for open scenes, characterized in that, Includes the following steps: A multimodal representation data of a visual object is generated, and the multimodal representation data is projected onto the compact latent space of the visual object to obtain the compact representation data of the visual object; A hypergraph structure is constructed based on the compact representation data, and higher-order association representations of open categories and known categories are learned through the hypergraph structure. Based on the higher-order association representations of the open category and the known category, embedding representation data of the unknown category is generated through a preset typical representation memory module, and the embedding representation data of the unknown category is applied to visual target retrieval in an open scene using a preset known category training model to obtain the visual object of the unknown category. The step of generating embedding representation data for unknown categories based on the higher-order association representations of the open and known categories through a preset typical representation memory module includes: calculating the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; and reconstructing the compact representation data according to each memory anchor and the activation score of each memory anchor to obtain the embedding representation data for the unknown categories. Before generating the embedded representation data of the unknown category through the preset typical representation memory module, the method further includes: constructing a typical representation memory network for the visual object; training and updating the typical representation memory network based on a preset memory reconstruction loss function to obtain the preset typical representation memory module.

2. The method according to claim 1, characterized in that, The multimodal representation data for generating visual objects includes: Configure the acquisition environment for the multimodal representation data; Based on the acquisition environment, multiple modal representations of the visual object are output through a preset capture device, and the basic features of the multiple modal representations are extracted to obtain the multimodal representation data.

3. The method according to claim 1, characterized in that, The step of projecting the multimodal representation data onto the compact latent space of the visual object to obtain the compact representation data of the visual object includes: Constructing an autoencoder for the multimodal representation data; Based on a preset loss function, the autoencoder of the multimodal representation data is trained to obtain a multimodal autoencoder; The multimodal autoencoder projects the multimodal representation data onto the compact latent space of the visual object to obtain the compact representation data of the visual object.

4. The method according to claim 1, characterized in that, The process of constructing a hypergraph structure based on the compact representation data, and learning higher-order association representations of open and known categories through the hypergraph structure, includes: Based on the compact representation data, the hypergraph structure is constructed using a preset K-nearest neighbor algorithm; Based on a preset hypergraph convolution iteration formula, the node features of the hypergraph structure are learned to obtain higher-order association representations of the open category and the known category.

5. A multimodal visual target retrieval system for open scenarios, characterized in that, include: A generation module is used to generate multimodal representation data of a visual object and project the multimodal representation data onto the compact latent space of the visual object to obtain compact representation data of the visual object; The association module is used to construct a hypergraph structure based on the compact representation data, and learn higher-order association representations of open categories and known categories through the hypergraph structure; The retrieval module is used to generate embedding representation data of unknown categories based on the high-order association representations of the open categories and the known categories, through a preset typical representation memory module, and to apply the embedding representation data of unknown categories to visual target retrieval in open scenes using a preset known category training model, so as to obtain the visual objects of the unknown categories. The retrieval module is specifically used for: calculating the activation score of each memory anchor in the compact representation data and the preset typical representation memory module; reconstructing the compact representation data based on each memory anchor and the activation score of each memory anchor to obtain the embedding representation data of the unknown category; Before generating the embedded representation data of the unknown category through the preset typical representation memory module, the retrieval module is further configured to: construct the typical representation memory network of the visual object; train and update the typical representation memory network based on the preset memory reconstruction loss function to obtain the preset typical representation memory module.

6. The system according to claim 5, characterized in that, The generation module is specifically used for: Configure the acquisition environment for the multimodal representation data; Based on the acquisition environment, multiple modal representations of the visual object are output through a preset capture device, and the basic features of the multiple modal representations are extracted to obtain the multimodal representation data.

7. An electronic device, characterized in that, Including memory and processor; The processor reads executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the multimodal visual target retrieval method for open scenes as described in any one of claims 1-4.

8. A computer-readable storage medium storing a computer program, characterized in that, When executed by a processor, the program implements the multimodal visual target retrieval method for open scenes as described in any one of claims 1-4.