Cross-modal data association retrieval method, device, equipment, storage medium and product

By acquiring multimodal data, extracting feature vectors, and performing entity recognition and ontology alignment, a cross-modal data association retrieval knowledge graph is constructed. This solves the problems of low cross-modal matching accuracy and lack of contextual relevance in retrieval results, and achieves higher-precision cross-modal data association retrieval.

CN122240903APending Publication Date: 2026-06-19INDUSTRIAL AND COMMERCIAL BANK OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
Filing Date
2026-02-04
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing cross-modal retrieval technologies suffer from low cross-modal matching accuracy and a lack of contextual relevance in retrieval results. This is mainly due to the heterogeneity of multimodal data, which leads to differences in feature extraction methods and ignores the semantic relationships behind the data.

Method used

By acquiring multimodal data, extracting multimodal feature vectors, performing entity recognition, ontology alignment, and similarity calculation, a cross-modal data association retrieval knowledge graph is constructed to determine the relationships between entities.

Benefits of technology

It improves the semantic association accuracy of cross-modal data, avoids matching failures caused by modal differences, and enhances the completeness and relevance of search results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240903A_ABST
    Figure CN122240903A_ABST
Patent Text Reader

Abstract

This application provides a method, apparatus, device, storage medium, and product for cross-modal data association retrieval, relating to the fintech field or other related fields. The method includes: acquiring multimodal data; extracting multimodal feature vectors from the multimodal data and performing entity recognition on the multimodal feature vectors to determine multiple entities in the multimodal data; performing ontology alignment processing and similarity calculation on the multiple entities respectively to determine the association relationships between the entities; constructing a cross-modal data association retrieval knowledge graph based on the multiple entities and the association relationships between entities; and performing cross-modal data association retrieval based on the cross-modal data association retrieval knowledge graph. The method of this application improves the completeness and relevance of the retrieval results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of financial technology or other related fields, and in particular to a method, apparatus, device, storage medium and product for cross-modal data association retrieval. Background Technology

[0002] With the development of internet technology, multimodal data, such as text, image, video, and audio data, has become a core carrier for information dissemination and knowledge acquisition. Cross-modal data is obtained by learning the deep relationships between multimodal data and is widely used in scenarios such as multimedia search engines and intelligent recommendation systems.

[0003] Currently, cross-modal retrieval technologies mainly rely on feature matching and semantic analysis. For example, deep learning-based methods extract visual features from images through convolutional neural networks, or extract semantic vectors from text through language understanding models, and then calculate the similarity between features of different modalities to achieve matching.

[0004] However, while such methods can handle some cross-modal associations, the heterogeneity of multimodal data and the differences in feature extraction methods lead to difficulties in association and low cross-modal matching accuracy. Furthermore, existing technologies only focus on surface feature matching and ignore the semantic relationships behind the data, resulting in a lack of contextual relevance in the retrieval results. Summary of the Invention

[0005] This application provides a cross-modal data association retrieval method, apparatus, device, storage medium, and product to solve the technical problems of low cross-modal matching accuracy and lack of contextual relevance in retrieval results in the prior art.

[0006] Firstly, this application provides a cross-modal data association retrieval method, including:

[0007] Acquire multimodal data;

[0008] Multimodal feature vectors are extracted from the multimodal data, and entity recognition is performed on the multimodal feature vectors to determine multiple entities in the multimodal data.

[0009] Perform ontology alignment and similarity calculation on the multiple entities respectively to determine the association relationship between the entities;

[0010] Based on the multiple entities and the relationships between them, a cross-modal data association retrieval knowledge graph is constructed, and cross-modal data association retrieval is performed based on the cross-modal data association retrieval knowledge graph.

[0011] Secondly, this application provides a cross-modal data association retrieval device, comprising:

[0012] The acquisition module is used to acquire multimodal data;

[0013] The determination module is used to extract multimodal feature vectors from the multimodal data and perform entity recognition on the multimodal feature vectors to determine multiple entities in the multimodal data.

[0014] The determining module is further configured to perform ontology alignment processing and similarity calculation on the plurality of entities respectively, and determine the association relationship between the entities;

[0015] The processing module is used to construct a knowledge graph for cross-modal data association retrieval based on the multiple entities and the relationships between them.

[0016] Thirdly, embodiments of this application provide an electronic device, including: a memory and a processor;

[0017] The memory stores computer-executed instructions;

[0018] The processor executes computer execution instructions stored in the memory, causing the processor to perform the first aspect and / or various possible implementations of the first aspect as described above.

[0019] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the first aspect and / or various possible implementations of the first aspect.

[0020] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the first aspect and / or various possible implementations of the first aspect.

[0021] This application provides a cross-modal data association retrieval method, apparatus, device, storage medium, and product. By acquiring multimodal data, extracting multimodal feature vectors from the multimodal data, and performing entity recognition on the multimodal feature vectors to identify multiple entities in the multimodal data, ontology alignment processing and similarity calculation are performed on each entity to determine the relationships between entities. Based on the multiple entities and their relationships, a cross-modal data association retrieval knowledge graph is constructed, and cross-modal data association retrieval is performed based on this knowledge graph. This addresses the problem in existing technologies where retrieval results lack contextual relevance, focusing only on surface feature matching and ignoring the semantic relationships behind the data. By determining the relationships between entities through ontology alignment processing and similarity calculation, the semantic association accuracy of cross-modal data is improved, avoiding matching failures caused by modal differences in traditional methods. This provides a data foundation for knowledge graph-based cross-modal retrieval and reasoning, enhancing the completeness and relevance of retrieval results. Attached Figure Description

[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0023] Figure 1 This is a schematic diagram of an application scenario provided by an embodiment of this application;

[0024] Figure 2 A flowchart illustrating a cross-modal data association retrieval method provided in this application embodiment. Figure 1 ;

[0025] Figure 3 A flowchart illustrating a cross-modal data association retrieval method provided in this application embodiment. Figure 2 ;

[0026] Figure 4 This is a schematic diagram of the structure of a cross-modal data association retrieval device provided in an embodiment of this application;

[0027] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.

[0028] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0029] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0030] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, storage, use, processing, transmission, provision, disclosure, and application of the relevant data all comply with the relevant laws, regulations, and standards of the relevant countries and regions, have taken necessary confidentiality measures, do not violate public order and good morals, and provide corresponding operation access points for users to choose to authorize or refuse.

[0031] Furthermore, the technical solution involved in this application, which involves big data analysis of user information (including but not limited to personal biometrics, identity data, consumption data, asset data, electronic terminal operation data, etc.) and the use of artificial intelligence technology for automated decision-making, and makes decisions that have a significant impact on personal rights based on the results of automated decision-making, provides users with corresponding operation entry points for users to choose to agree to or reject the results of automated decision-making; if the user chooses to reject, the process will proceed to the expert decision-making process.

[0032] It should be noted that the cross-modal data association retrieval method, apparatus, device, storage medium and product provided in this application can be used in the fintech field, or in any field other than fintech. The application fields of the cross-modal data association retrieval method, apparatus, device, storage medium and product in this application are not limited.

[0033] With the development of internet technology, multimodal data, such as text, image, video, and audio data, has become a core carrier for information dissemination and knowledge acquisition. Cross-modal data is obtained by learning the deep relationships between multimodal data and is widely used in scenarios such as multimedia search engines and intelligent recommendation systems.

[0034] Taking a multimedia search engine as an example, Figure 1 This is a schematic diagram of an application scenario provided in an embodiment of this application, such as... Figure 1 As shown, when a user enters the search term "apple" into the multimedia search engine, the multimedia search engine performs feature matching and semantic analysis on the user's input and provides search results, such as image data of apples and category recommendations.

[0035] However, while such methods can handle some cross-modal associations, the heterogeneity of multimodal data and the differences in feature extraction methods lead to difficulties in association and low cross-modal matching accuracy. Furthermore, existing technologies only focus on surface feature matching and ignore the semantic relationships behind the data, resulting in a lack of contextual relevance in the retrieval results. For example, they may not associate the results with deeper content such as "apple falling" or "universal gravity".

[0036] This application provides a cross-modal data association retrieval method, which constructs a knowledge graph for cross-modal data association retrieval by performing ontology alignment and similarity calculation on multiple entities in multimodal data, thereby realizing cross-modal data association retrieval and solving the above-mentioned technical problems of the prior art.

[0037] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0038] Figure 2 A flowchart illustrating a cross-modal data association retrieval method provided in this application embodiment. Figure 1 ,like Figure 2 As shown, the method includes:

[0039] S201. Obtain multimodal data.

[0040] Multimodal data refers to heterogeneous data sets in different forms, including but not limited to text data, image data, and video data.

[0041] In this step, multimodal data, such as text data, image data, and video data, are obtained from public data sources. Public data sources can be, for example, the Internet or databases.

[0042] Preferably, before extracting multimodal feature vectors from multimodal data, the multimodal data can be preprocessed, including but not limited to data cleaning and standardization.

[0043] For example, blurry images and repetitive text are cleaned to remove noise data, convert image data to a uniform format, such as JPG, and text data to a uniform encoding, such as UTF-8 encoding.

[0044] S202. Extract multimodal feature vectors from multimodal data and perform entity recognition on the multimodal feature vectors to identify multiple entities in the multimodal data.

[0045] Among them, multimodal feature vectors refer to numerical representations used to characterize the semantic or visual features of multimodal data, and feature extraction models can be used to extract features from multimodal data.

[0046] Entity recognition refers to identifying entities from multimodal data. An entity is a thing or concept that can be defined independently, such as a light, a sensor, or the net asset value of a fund.

[0047] In this step, for each single-modal data in the multimodal data, features are extracted from the single-modal data using the feature extraction model corresponding to that single-modal data to obtain a feature vector, and multiple entities in the multimodal data are identified based on the multimodal feature vector.

[0048] For example, the entity "apple falling" can be identified from image data of an apple falling, and the entity "gravity" can be identified from text data.

[0049] In one possible implementation, the multimodal data includes text data, image data, and video data. The process of extracting multimodal feature vectors from the multimodal data and performing entity recognition on these feature vectors to identify multiple entities within the multimodal data is described in detail, including:

[0050] Multimodal feature vectors are extracted from keyframes and audio of text data, image data, and video data, respectively. Entity recognition is performed on the multimodal feature vectors, and the entities identified from the keyframes and audio are associated with the timeline information of the video data to obtain multiple entities in the multimodal data.

[0051] Keyframes and audio data are extracted from video data. Semantic features of text data are extracted using a language understanding model, and visual features of keyframes in image and video data are extracted using a convolutional neural network. Mel-spectral features of audio data are extracted through spectral analysis. Text entities are identified using feature encoding-global decoding, and object detection is performed on visual features using an object detection model. Optical recognition techniques can also be used to extract text entities from images. Entity determination can be achieved by combining image and audio features from keyframes, or by using automatic speech recognition to identify text in video and audio data, and by associating entities identified from keyframes and audio using the timeline information of the video data, multiple entities in multimodal data can be obtained.

[0052] For example, for an image of an apple falling to the ground, visual features are extracted using a convolutional neural network, and entities such as apples and trees are identified in the image using an object detection model.

[0053] For example, the entity "tiger roar" can be identified by using audio features and tiger images in keyframes, and the timeline information of the video data can be associated with it, such as a scene where the tiger roars at the 10th second.

[0054] Through the collaborative processing of multimodal feature extraction models, a unified semantic representation of cross-modal data is achieved, providing an accurate semantic foundation for entity recognition and association establishment, enhancing the matching capability of cross-modal data, and avoiding the semantic gap caused by the modal differences of data in existing technologies.

[0055] S203. Perform ontology alignment and similarity calculation on multiple entities to determine the relationships between entities.

[0056] Ontology alignment refers to mapping different modal entities to the same semantic space through a unified ontology framework and establishing associations. For example, scientific phenomena → association → scientific theories; events → triggers → executors → changes → states. For instance, a user issues a command to turn on a light, triggering the light to turn on.

[0057] Similarity calculation refers to calculating the similarity score between the feature vectors of entities. For example, similarity calculation can be achieved by calculating cosine similarity, Euclidean distance, etc.

[0058] In this step, for each entity, each entity is mapped to the corresponding ontology through a unified ontology framework, and the similarity between entities is calculated to determine the association between entities.

[0059] For example, the image entity "apple falling" is mapped to the ontology "scientific phenomenon", the text entity "universal gravitation" is mapped to the ontology "scientific theory", and the image entity "on state" is mapped to the ontology "state".

[0060] For example, the voice entity "user turn on the light command" is mapped to the ontology "event", and the image entity "light" is mapped to the ontology "actuator".

[0061] In one possible implementation, the above-mentioned process of performing ontology alignment and similarity calculation on multiple entities to determine the relationships between entities is described in detail, including:

[0062] Using a unified ontology framework, multiple entities are mapped to corresponding ontologies; for each entity, a first similarity score is calculated between the entity and other entities; among the multiple first similarity scores, the other entities corresponding to the first similarity scores that reach a first threshold are taken as the related entities of that entity; the preset relationship between the entity's ontology and the ontology of related entities is taken as the association relationship between the entity and related entities; based on the association relationship between each entity and related entities, the association relationship between entities is determined.

[0063] Multiple entities are classified according to the various ontologies in a unified ontology framework, and then mapped to their corresponding ontologies. For each entity, a first similarity score is calculated between each entity and other entities, such as cosine similarity or Euclidean distance. Other entities corresponding to first similarity scores that reach a first threshold are considered as related entities of that entity. A pre-defined relationship between the entity's ontology and the ontologies of related entities is used as the association relationship between the entity and its related entities, thus determining the relationships between entities.

[0064] For example: By calculating similarity, it is determined that the falling apple is a related entity of gravity. Therefore, according to the unified ontology framework of "scientific phenomenon → association → scientific theory", the corresponding relationship between the entities falling apple and gravity is determined as: falling apple → association → gravity.

[0065] For example, the sequence is: event → trigger → executor → change → state. Correspondingly, the user's command to turn on the light is: trigger → light → change → on state.

[0066] By using a unified ontology framework, the differences between different modalities are eliminated, and the relationships between entities can be established quickly and accurately.

[0067] S204. Based on multiple entities and the relationships between entities, construct a cross-modal data association retrieval knowledge graph, and perform cross-modal data association retrieval based on the cross-modal data association retrieval knowledge graph.

[0068] Among them, the relationship is used to indicate the connection between entities.

[0069] Cross-modal data association retrieval refers to retrieval results that include both the query content and related information, and the results are not limited to a single modality of data. For example, when entering the text query "universal gravitation," one can not only retrieve related text about universal gravitation, but also related images of apples falling to the ground, or keyframes corresponding to videos of apples falling to the ground.

[0070] In this step, a cross-modal data association retrieval knowledge graph is constructed based on the extracted multiple entities and the relationships between them. Based on the cross-modal data association retrieval knowledge graph, cross-modal data association retrieval is performed to obtain cross-modal query content and information related to the query content.

[0071] One possible implementation details the construction of a cross-modal data association retrieval knowledge graph based on multiple entities and the relationships between them, including:

[0072] For each entity, the association weight between the entity and related entities is determined based on the first similarity score between the entity and related entities; triples are determined based on the entity, related entities of the entity, and associations, and the associations are labeled by association weights; and a cross-modal data association retrieval knowledge graph is constructed based on the triples of multiple entities.

[0073] For each entity, a first similarity score is calculated between the entity and its related entities. This first similarity score is used to label the association between the entity and its related entities, thus obtaining the association weight. Triples (entity, association, related entity) are constructed using the entity, its related entities, and the association, where the association is quantified by the association weight. A cross-modal data association retrieval knowledge graph is then constructed using triples from multiple entities.

[0074] By identifying the relationships between entities and constructing triples, a cross-modal data association retrieval knowledge graph is built, providing a structured knowledge foundation for subsequent cross-modal retrieval.

[0075] This embodiment provides a cross-modal data association retrieval method. It acquires multimodal data, extracts multimodal feature vectors from the multimodal data, performs entity recognition on the multimodal feature vectors to identify multiple entities in the multimodal data, performs ontology alignment and similarity calculation on each entity to determine the relationships between entities, constructs a cross-modal data association retrieval knowledge graph based on the multiple entities and their relationships, and performs cross-modal data association retrieval based on this knowledge graph. This method solves the problem in existing technologies where retrieval results lack contextual relevance, focusing only on surface feature matching and ignoring the semantic relationships behind the data. By determining the relationships between entities through ontology alignment and similarity calculation, it improves the semantic association accuracy of cross-modal data, avoids matching failures caused by modality differences in traditional methods, provides a data foundation for knowledge graph-based cross-modal retrieval and reasoning, and enhances the completeness and relevance of retrieval results.

[0076] Figure 3 A flowchart illustrating a cross-modal data association retrieval method provided in this application embodiment. Figure 2 In this embodiment Figure 2 Based on the examples, this paper provides a detailed explanation of cross-modal data association retrieval based on a knowledge graph for cross-modal data association retrieval, such as... Figure 3 As shown, the method includes:

[0077] S301. In response to the data query command sent by the user, extract the query feature vector of the data query command.

[0078] Among them, data query commands refer to request commands used to query data. They can be used to query unimodal data or multimodal data.

[0079] A query feature vector is a numerical representation of a data query command.

[0080] After receiving a data query instruction from a user, the system extracts the query feature vector of the data query instruction based on the query content in the instruction.

[0081] S302. Determine the query node and its adjacent nodes from the cross-modal data association retrieval knowledge graph.

[0082] In this step, N query nodes are arbitrarily selected from the knowledge graph retrieved from cross-modal knowledge data association, along with their adjacent nodes. Here, N is a positive integer.

[0083] By identifying the query node and its adjacent nodes, the semantic boundaries of the user query are expanded.

[0084] S303. Calculate the second similarity score between the entities stored in the query node and adjacent nodes and the query feature vector.

[0085] The second similarity score is used to quantify the similarity between the entity and the query feature vector.

[0086] In this step, for each query node and its neighboring nodes, a second similarity score is calculated between the feature vector of the entity stored in the node and the query feature vector. This second similarity score can be, for example, cosine similarity or Euclidean distance.

[0087] S304. Determine whether there is a second similarity score that reaches the second threshold; if so, proceed to step S305; if not, proceed to step S306.

[0088] Determine if there exists a second similarity score that reaches the second threshold; if there is a second similarity score that reaches the second threshold, it means that an entity with high similarity to the query feature vector has been found.

[0089] The node corresponding to the highest second similarity score among the second similarity scores that reach the second threshold is identified as the target node. Based on the knowledge graph, nodes that are associated with the target node are selected as candidate nodes. The matching weight between the entity stored in the target node and the query feature vector is set to 1. The association weight between the candidate node and the target node in the knowledge graph is used as the matching weight between the entity stored in the candidate node and the query feature vector.

[0090] The matching score between the entity and the query feature vector is calculated using the following formula:

[0091]

[0092] in, To match scores, For predefined matching coefficients, The second similarity score, For matching weights.

[0093] Based on the matching score, the entities in the query node and its adjacent nodes are sorted, and the multimodal retrieval results of the query feature vector are obtained according to the sorting results. The matching score is calculated by association weight and sorted based on the matching score, which improves the retrieval quality of the multimodal retrieval results.

[0094] If no second similarity score reaches the second threshold, it means that there is no entity with high similarity to the query feature vector. The node with the highest second similarity score is taken as the new query node, and the adjacent nodes of the new query node are re-determined until the multimodal retrieval result of the query feature vector is obtained.

[0095] For example, when a user queries "user turn on light command", the knowledge graph is used for association retrieval to obtain "user turn on light command → trigger → light → change → on status".

[0096] For example, when a user searches for "apple", they can not only find images of apples, but also deeper information related to "gravity".

[0097] S305. Based on the second similarity score and the association weight stored in the knowledge graph, calculate the matching score between the entities in the query node and adjacent nodes and the query feature vector, and sort the entities in the query node and adjacent nodes based on the matching score to obtain the multimodal retrieval result of the query feature vector.

[0098] S306. Take the node with the highest second similarity score as the new query node, and redetermine the adjacent nodes of the new query node until the multimodal retrieval result of the query feature vector is obtained.

[0099] This application provides a cross-modal data association retrieval method that, in response to a user's data query command, extracts the query feature vector of the data query command, determines the query node and its adjacent nodes from the cross-modal data association retrieval knowledge graph, calculates a second similarity score between the entities stored in the query node and its adjacent nodes and the query feature vector, and determines whether any of the second similarity scores reach a second threshold. If so, based on the second similarity score and the association weights stored in the knowledge graph, it calculates a matching score between the entities in the query node and its adjacent nodes and the query feature vector, and sorts the entities in the query node and its adjacent nodes based on the matching score to obtain a multimodal retrieval result for the query feature vector. If not, it takes the node with the highest second similarity score as the new query node and re-determines the adjacent nodes of the new query node, until a multimodal retrieval result for the query feature vector is obtained. After the user inputs a query command, the reasoning capability of the knowledge graph allows not only the query content to be retrieved but also related deep content, improving the completeness and relevance of the retrieval results.

[0100] Figure 4 This is a schematic diagram of the structure of a cross-modal data association retrieval device provided in an embodiment of this application, as shown below. Figure 4 As shown, the cross-modal data association retrieval device 40 provided in this embodiment includes: an acquisition module 401, a determination module 402, and a processing module 403;

[0101] Among them, the acquisition module 401 is used to acquire multimodal data;

[0102] The determination module 402 is used to extract multimodal feature vectors from multimodal data and perform entity recognition on the multimodal feature vectors to determine multiple entities in the multimodal data.

[0103] The determination module 402 is also used to perform ontology alignment and similarity calculation on multiple entities respectively, and to determine the association relationship between entities;

[0104] Processing module 403 is used to construct a knowledge graph for cross-modal data association retrieval based on multiple entities and the relationships between entities.

[0105] In one possible implementation, the determining module 402 is further configured to map multiple entities to corresponding ontologies through a unified ontology framework; calculate a first similarity score between the entity and other entities for each entity; identify other entities corresponding to the first similarity scores that reach a first threshold among the multiple first similarity scores as related entities of the entity; use the preset relationship between the entity's ontology and the ontology of related entities as the association relationship between the entity and related entities; and determine the association relationship between entities based on the association relationship between each entity and related entities.

[0106] In one possible implementation, the determining module 402 is further configured to, for each entity, determine the association weight of the association relationship between the entity and related entities based on the first similarity score between the entity and related entities; determine triples based on the entity, related entities of the entity, and association relationships, and label the association relationships through association weights; and construct a cross-modal data association retrieval knowledge graph based on the triples of multiple entities.

[0107] In one possible implementation, the acquisition module 401 is further configured to extract multimodal feature vectors from keyframes and audio of text data, image data, and video data, respectively; perform entity recognition on the multimodal feature vectors; and associate the entities identified from the keyframes and audio with the timeline information of the video data to obtain multiple entities of the multimodal data.

[0108] In one possible implementation, the processing module 403, in response to the data query instruction sent by the user, is also used to extract the query feature vector of the data query instruction; based on the cross-modal data association retrieval knowledge graph, it performs cross-modal feature association retrieval on the query feature vector to obtain the multimodal retrieval result of the query feature vector.

[0109] In one possible implementation, the processing module 403 is further configured to determine the query node and its adjacent nodes from the cross-modal data association retrieval knowledge graph; based on the query node and its adjacent nodes, perform semantic association queries on the query feature vector, and continue to determine new query nodes and adjacent nodes until the multimodal retrieval result of the query feature vector is obtained.

[0110] In one possible implementation, the processing module 403 is further configured to calculate a second similarity score between the entities stored in the query node and its neighboring nodes and the query feature vector; determine whether there exists a second similarity score that reaches a second threshold; if so, calculate a matching score between the entities in the query node and its neighboring nodes and the query feature vector based on the second similarity score and the association weights stored in the knowledge graph, and sort the entities in the query node and its neighboring nodes based on the matching score to obtain a multimodal retrieval result for the query feature vector; if not, take the node with the highest second similarity score as the new query node, and redetermine the neighboring nodes of the new query node until a multimodal retrieval result for the query feature vector is obtained.

[0111] This embodiment provides a cross-modal data association retrieval device that can execute the method provided in the above-described method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.

[0112] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 5 As shown, the electronic device 50 provided in this embodiment includes at least one processor 501 and a memory 502. Optionally, the electronic device 50 further includes a communication component 503. The processor 501, memory 502, and communication component 503 are connected via a bus 504.

[0113] In a specific implementation, at least one processor 501 executes computer execution instructions stored in memory 502, causing at least one processor 501 to perform the above-described method.

[0114] The specific implementation process of processor 501 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.

[0115] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.

[0116] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.

[0117] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.

[0118] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.

[0119] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the above-described method.

[0120] When integrated units / modules are implemented in hardware, the hardware can be digital circuits, analog circuits, etc. The physical implementation of the hardware structure includes, but is not limited to, transistors, memristors, etc. Unless otherwise specified, the processor can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, and ASIC, etc. Unless otherwise specified, the storage unit can be any suitable magnetic or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory (HBM), Hybrid Memory Cube (HMC), etc.

[0121] If the integrated unit / module is implemented as a software program module and sold or used as an independent product, it can be stored in a computer-readable storage device. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application.

[0122] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily essential to this application.

[0123] It should be further noted that although the steps in the flowchart are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0124] It should be understood that the above-described device embodiments are merely illustrative, and the device of this application can also be implemented in other ways. For example, the division of units / modules in the above embodiments is only a logical functional division, and there may be other division methods in actual implementation. For example, multiple units, modules, or components may be combined, or integrated into another system, or some features may be ignored or not executed.

[0125] Furthermore, unless otherwise specified, the functional units / modules in the various embodiments of this application can be integrated into one unit / module, or each unit / module can exist physically separately, or two or more units / modules can be integrated together. The integrated units / modules described above can be implemented in hardware or as software program modules.

[0126] In the above embodiments, the descriptions of each embodiment have their own emphasis. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification.

[0127] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0128] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A cross-modal data association retrieval method, characterized in that, include: Acquire multimodal data; Multimodal feature vectors are extracted from the multimodal data, and entity recognition is performed on the multimodal feature vectors to determine multiple entities in the multimodal data. Perform ontology alignment and similarity calculation on the multiple entities respectively to determine the association relationship between the entities; Based on the multiple entities and the relationships between them, a cross-modal data association retrieval knowledge graph is constructed, and cross-modal data association retrieval is performed based on the cross-modal data association retrieval knowledge graph.

2. The method according to claim 1, characterized in that, The step of performing ontology alignment and similarity calculation on the multiple entities respectively to determine the association relationship between the entities includes: Through a unified ontology framework, the multiple entities are mapped to their corresponding ontologies; For each entity, calculate the first similarity score between the entity and other entities; Among the multiple first similarity scores, the other entities corresponding to the first similarity scores that reach the first threshold are regarded as the related entities of the entity; The preset relationship between the entity's body and the bodies of the related entities is taken as the association relationship between the entity and the related entities; The relationships between the entities are determined based on the associations between each entity and its related entities.

3. The method according to claim 2, characterized in that, The construction of a cross-modal data association retrieval knowledge graph based on the multiple entities and the relationships between them includes: For each entity, the association weight of the association relationship between the entity and the related entity is determined based on the first similarity score between the entity and the related entity; Based on the entity, the entity's related entities, and the association relationships, triples are determined, and the association relationships are labeled using the association weights. Based on the triples of the multiple entities, a cross-modal data association retrieval knowledge graph is constructed.

4. The method according to claim 1, characterized in that, The multimodal data includes text data, image data, and video data. The step of extracting multimodal feature vectors from the multimodal data and performing entity recognition on the multimodal feature vectors to determine multiple entities in the multimodal data includes: Multimodal feature vectors are extracted from the keyframes and audio of the text data, image data, and video data, respectively. Entity recognition is performed on the multimodal feature vectors, and entities identified from keyframes and audio are associated with the timeline information of the video data to obtain multiple entities from the multimodal data.

5. The method according to any one of claims 1-4, characterized in that, The cross-modal data association retrieval based on the cross-modal data association retrieval knowledge graph includes: In response to a data query command sent by a user, the query feature vector of the data query command is extracted; Based on the cross-modal data association retrieval knowledge graph, cross-modal feature association retrieval is performed on the query feature vector to obtain the multimodal retrieval results of the query feature vector.

6. The method according to claim 5, characterized in that, The method of performing cross-modal feature association retrieval on the query feature vector based on the cross-modal data association retrieval knowledge graph to obtain the multimodal retrieval results of the query feature vector includes: From the cross-modal data association retrieval knowledge graph, determine the query node and the adjacent nodes of the query node; Based on the query node and adjacent nodes, a semantic association query is performed on the query feature vector, and new query nodes and adjacent nodes are determined until the multimodal retrieval result of the query feature vector is obtained.

7. The method according to claim 6, characterized in that, The step of performing semantic association queries on the query feature vector based on the query node and adjacent nodes, and continuing to determine new query nodes and adjacent nodes until the multimodal retrieval result of the query feature vector is obtained, includes: Calculate the second similarity score between the entities stored in the query node and its neighboring nodes and the query feature vector; Determine whether a second similarity score reaches the second threshold; If they exist, then based on the second similarity score and the association weights stored in the knowledge graph, the matching score between the entities in the query node and the adjacent nodes and the query feature vector is calculated, and based on the matching score, the entities in the query node and the adjacent nodes are sorted to obtain the multimodal retrieval result of the query feature vector; If it does not exist, the node with the highest second similarity score is taken as the new query node, and the adjacent nodes of the new query node are re-determined until the multimodal retrieval result of the query feature vector is obtained.

8. A cross-modal data association retrieval device, characterized in that, include: The acquisition module is used to acquire multimodal data; The determination module is used to extract multimodal feature vectors from the multimodal data and perform entity recognition on the multimodal feature vectors to determine multiple entities in the multimodal data. The determining module is further configured to perform ontology alignment processing and similarity calculation on the plurality of entities respectively, and determine the association relationship between the entities; The processing module is used to construct a knowledge graph for cross-modal data association retrieval based on the multiple entities and the relationships between them.

9. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1 to 7.

11. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method of any one of claims 1 to 7.