A multimodal retrieval method and system based on knowledge graphs

By identifying entity information in multimodal archive data, calculating cross-modal consistency scores, constructing semantic anchors to generate knowledge events, optimizing the knowledge graph structure, and combining disambiguation processing with user query habits, the heterogeneity and dynamic adaptation issues of multimodal data are resolved, resulting in more accurate and comprehensive retrieval results.

CN121387936BActive Publication Date: 2026-06-30GUANGDONG LIXUN INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG LIXUN INFORMATION TECH CO LTD
Filing Date
2025-11-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies suffer from heterogeneity and semantic dispersion in the construction of knowledge graphs for multimodal data, resulting in incomplete entity recognition, insufficient relationship mining, difficulty in handling ambiguity or vagueness in user query intent, low relevance and limited coverage of search results, and difficulty in adapting to dynamic data growth.

Method used

By identifying entity information in multimodal archive data, calculating cross-modal consistency scores, constructing semantic anchors to generate knowledge events, optimizing the knowledge graph structure, combining disambiguation processing with user query habits, analyzing semantic paths to calculate relevance scores, and realizing the structured integration and dynamic adaptation of multimodal information.

Benefits of technology

It significantly improves the relevance and coverage of search results, enhances the ability to handle fuzzy or ambiguous user queries, and provides more in-depth and comprehensive search results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121387936B_ABST
    Figure CN121387936B_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal retrieval method and system based on knowledge graphs. The method includes: identifying entity information in multimodal archive data to calculate cross-modal consistency scores; constructing semantic anchors based on the cross-modal consistency scores to generate knowledge events; acquiring several cross-modal verification information of the knowledge events to structurally optimize a preset knowledge graph, obtaining a target knowledge graph; determining the query intent based on query information and user historical query habits, and extracting semantic extension information from the target knowledge graph to disambiguate the query intent; determining several retrieval results based on the disambiguated query intent using the target knowledge graph, and analyzing the semantic path between the disambiguated query intent and each retrieval result; and calculating the relevance score of each retrieval result based on the semantic path for ranking. This invention can provide more in-depth retrieval results, improving the relevance and coverage of the retrieval results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a multimodal retrieval method and system based on knowledge graphs. Background Technology

[0002] In scenarios such as electronic archives and digital libraries, the format of archival data has expanded from single text to multiple modalities such as images, audio, and video. How to effectively organize this fragmented and heterogeneous multimodal information and quickly and accurately retrieve the content needed by users has become a major challenge facing the field of information processing.

[0003] In existing technologies, knowledge graph technology describes entities and their relationships in a structured manner, providing an effective means for semantic association of multi-source heterogeneous data. It is increasingly being applied in fields such as archives management and digital libraries to support information retrieval services. However, traditional knowledge graph-based retrieval methods still face many challenges when processing multimodal data. On the one hand, the inherent heterogeneity and semantic dispersion of multimodal data make it difficult for traditional methods to achieve deep semantic fusion and association across modalities, resulting in incomplete entity identification and insufficient relationship mining during knowledge graph construction, thus affecting a comprehensive understanding of the archive content. On the other hand, users' query intentions are often ambiguous or vague when conducting information retrieval, and existing retrieval models lack effective disambiguation and semantic expansion capabilities for query conditions, resulting in low relevance and limited coverage of search results, failing to meet users' needs for accurate and comprehensive information retrieval. Furthermore, the static construction model of existing knowledge graphs is also difficult to adapt to dynamically growing multimodal data, further restricting the accuracy and practicality of the retrieval system. Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of the prior art. This invention provides a multimodal retrieval method and system based on knowledge graphs, which can provide more in-depth and extensive retrieval results, and significantly improve the relevance and coverage of the retrieval results.

[0005] To address the aforementioned technical problems, this invention provides a multimodal retrieval method based on knowledge graphs, the method comprising:

[0006] Acquire multimodal archive data, identify entity information in the multimodal archive data, and calculate cross-modal consistency scores based on the entity information;

[0007] Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on the semantic anchors.

[0008] Acquire several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph;

[0009] Obtain the query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent.

[0010] Based on the disambiguation-processed query intent, several retrieval results are determined using the target knowledge graph, and the semantic paths between the disambiguation-processed query intent and each retrieval result are analyzed.

[0011] The relevance score of each search result is calculated based on the semantic path, and the search results are sorted based on the relevance score.

[0012] Optionally, identifying entity information in the multimodal archive data and calculating a cross-modal consistency score based on the entity information includes:

[0013] The terminology of the text data in the multimodal archive data is determined, and the co-occurrence frequency and semantic association strength of the terminology with the remaining words in the preset vocabulary list are calculated.

[0014] The contextual uniqueness score of the term vocabulary is calculated based on the co-occurrence frequency and semantic association strength.

[0015] The visual data in the multimodal archive data is subjected to contrast enhancement processing to obtain contrast-enhanced visual data;

[0016] Determine the edge detection information of the visual data after contrast enhancement processing, and calculate the visual sharpness score based on the edge detection information;

[0017] Speaker discrimination analysis is performed on the audio data in the multimodal archive data to obtain speaker discrimination scores, and transformed text confidence analysis is performed on the audio data in the multimodal archive data to obtain transformed text confidence scores.

[0018] Confidence cues for multimodal archive data are generated based on the contextual uniqueness score, visual clarity score, speaker discrimination score, and transformed text confidence score.

[0019] Based on the confidence cues, co-occurrence information analysis is performed to obtain target co-occurrence information, and entity information is determined based on the target co-occurrence information;

[0020] The semantic alignment of the entity information is analyzed, and a cross-modal consistency score is calculated based on the semantic alignment and the historical context adaptation factor.

[0021] Optionally, the step of performing co-occurrence information analysis based on the confidence cues to obtain target co-occurrence information includes:

[0022] Based on the confidence cues, target words for a preset historical period are determined, and based on the confidence cues, detailed light and shadow changes of visual cues are analyzed to obtain detailed light and shadow change information.

[0023] Based on the confidence cues, the spoken keywords of the audio cues are analyzed to obtain spoken keywords, and a multi-dimensional association graph is constructed based on the target words, detailed light and shadow change information and spoken keywords.

[0024] Based on the multidimensional association graph, a dense connection subgraph is determined, and co-occurrence information is analyzed based on the dense connection subgraph to obtain target co-occurrence information.

[0025] Optionally, the step of constructing semantic anchors based on the cross-modal consistency scores and generating knowledge events based on the semantic anchors includes:

[0026] The cross-modal consistency score is compared with a preset threshold, and semantic anchors are constructed based on the comparison results;

[0027] Information fragments in multimodal archive data are identified, and semantic alignment between information fragments and abstract concept information is performed using a projection network based on the semantic anchors to obtain semantic alignment results;

[0028] Determine the auxiliary calibration information of the information fragments, and determine the influence weight of the auxiliary calibration information based on the heterogeneous modal feature calibration layer;

[0029] Knowledge events are generated based on the semantic alignment results, auxiliary calibration information, and the influence weights of the auxiliary calibration information.

[0030] Optionally, the step of using a projection network based on the semantic anchors to perform semantic alignment between information fragments and abstract concept information, and obtaining semantic alignment results, includes:

[0031] Based on the concept evolution trajectory tracker, the semantic anchors are used to analyze the semantic expression changes of abstract concept information to obtain semantic expression change information.

[0032] The loss function is determined based on the semantic evolution adaptation term, and a projection network is constructed based on the loss function;

[0033] Based on the semantic expression change information, a projection network is used to perform semantic alignment between information fragments and abstract concept information to obtain semantic alignment results.

[0034] Optionally, the analysis of the query intent after disambiguation and the semantic path of each search result includes:

[0035] Identify the query intent after disambiguation and the historical evolution of conceptual information for each search result;

[0036] Based on the context, the historical evolution concept information is used to generate a first contextualized semantic vector of the query intent after disambiguation processing and a second contextualized semantic vector of each search result.

[0037] Based on the time context matching factor, the time series consistency of the first contextualized semantic vector and each second contextualized semantic vector is evaluated to obtain the time series consistency evaluation results.

[0038] Identify historical period terms and colloquial expressions in the search results, perform standard semantic entity mapping using the historical period terms and colloquial expressions based on a multi-dimensional mapping model, obtain standard semantic entity mapping information, and determine the confidence level of the standard semantic entity mapping information;

[0039] Based on the historical evolution concept information, time series consistency assessment results, standard semantic entity mapping information, and confidence analysis of the standard semantic entity mapping information, the semantic path of the query intent and each retrieval result after disambiguation processing is determined.

[0040] Optionally, the identification of historical period terms and colloquial expressions in the retrieval results, based on a multi-dimensional mapping model, utilizes these terms and expressions to perform standard semantic entity mapping, obtaining standard semantic entity mapping information, including:

[0041] Construct a multi-source heterogeneous vocabulary library, and based on the multi-source heterogeneous vocabulary library, combine it with contextual semantic recognition to identify historical period terms and colloquial expressions in the retrieval results;

[0042] A multi-dimensional mapping model is constructed based on a polysemous mapping disambiguation decision tree. Based on the multi-dimensional mapping model, the historical terms and spoken expressions are mapped to the standard semantic entities corresponding to the target knowledge graph, thereby obtaining standard semantic entity mapping information.

[0043] Optionally, the construction of a multi-dimensional mapping model based on a polysemous mapping disambiguation decision tree includes:

[0044] Semantic recognition impact analysis was performed on the terms and spoken expressions of the aforementioned historical period to obtain semantic recognition impact information;

[0045] Based on the semantic recognition impact information, combined with the temporal context matching factor and expert annotation information, an initial polysemous mapping disambiguation decision tree is constructed.

[0046] Obtain the user's mapping option correction history and mapping option feedback history, and adjust the initial polysemous mapping disambiguation decision tree based on the mapping option correction history and mapping option feedback history to obtain the target polysemous mapping disambiguation decision tree, and construct a multi-dimensional mapping model based on the target polysemous mapping disambiguation decision tree.

[0047] Optionally, calculating the relevance score of each search result based on the semantic path includes:

[0048] Obtain the length and relational importance of the semantic path, and calculate the strength of the semantic path based on the length and relational importance of the semantic path;

[0049] Obtain semantic cluster information of semantic paths, and calculate the relevance score of each search result based on the semantic cluster information and the strength of semantic paths.

[0050] In addition, the present invention also provides a multimodal retrieval system based on knowledge graphs, the system comprising:

[0051] Score calculation module: used to acquire multimodal archive data, identify entity information of the multimodal archive data, and calculate cross-modal consistency score based on the entity information;

[0052] Event generation module: used to construct semantic anchors based on the cross-modal consistency score, and generate knowledge events based on the semantic anchors;

[0053] Knowledge graph optimization module: used to acquire several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph;

[0054] Intent disambiguation module: used to obtain query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent;

[0055] Semantic path analysis module: used to determine several search results based on the disambiguated query intent using the target knowledge graph, and to analyze the semantic path between the disambiguated query intent and each search result;

[0056] Search and ranking module: used to calculate the relevance score of each search result based on the semantic path, and to rank each search result based on the relevance score.

[0057] In this embodiment of the invention, entity information of multimodal archive data is identified, and cross-modal consistency scores are calculated based on the entity information, effectively solving the semantic dispersion problem caused by the heterogeneity of multimodal data. Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on these anchors, achieving structured integration of multimodal information. Several cross-modal verification information of knowledge events is obtained, and the structure of a pre-defined knowledge graph is optimized based on this information, enabling the knowledge graph to dynamically adapt to the growth and changes in multimodal data. The query intent is determined based on query information combined with the user's historical query habits. Semantic extension information is extracted based on the target knowledge graph, and disambiguation processing is performed on the query intent based on this semantic extension information, significantly improving the ability to handle fuzzy or ambiguous user queries, thereby improving the relevance of search results. Based on the disambiguated query intent, several search results are determined using the target knowledge graph, and the semantic path between the disambiguated query intent and each search result is analyzed. Based on the semantic path, the relevance score of each search result is calculated, and the search results are ranked based on the relevance score. This provides more in-depth and comprehensive search results, significantly improving the relevance and coverage of the search results. Attached Figure Description

[0058] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0059] Figure 1 This is a flowchart illustrating the multimodal retrieval method based on knowledge graphs in an embodiment of the present invention.

[0060] Figure 2 This is a flowchart illustrating a knowledge graph-based multimodal retrieval method according to another embodiment of the present invention;

[0061] Figure 3 This is a schematic diagram of the structural composition of a knowledge graph-based multimodal retrieval system according to an embodiment of the present invention. Detailed Implementation

[0062] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0063] Example 1

[0064] Please see Figure 1 , Figure 1 This is a flowchart illustrating a knowledge graph-based multimodal retrieval method according to an embodiment of the present invention. The method includes:

[0065] S11: Acquire multimodal archive data, identify entity information of the multimodal archive data, and calculate cross-modal consistency score based on the entity information;

[0066] In the specific implementation of this invention, the terminology of the text data in the multimodal archive data is determined, and the co-occurrence frequency and semantic association strength of the terminology with other words in a preset vocabulary list are calculated; the context uniqueness score of the terminology is calculated based on the co-occurrence frequency and semantic association strength; contrast enhancement processing is performed on the visual data in the multimodal archive data; edge detection information of the contrast-enhanced visual data is determined, and visual sharpness score is calculated based on the edge detection information; speaker discrimination analysis is performed on the audio data in the multimodal archive data to obtain speaker discrimination score; and converted text confidence analysis is performed on the audio data in the multimodal archive data. The system obtains the confidence score of the transformed text; generates confidence cues for multimodal archive data based on the contextual uniqueness score, visual clarity score, speaker discrimination score, and transformed text confidence score; performs co-occurrence information analysis based on the confidence cues to obtain target co-occurrence information, and determines entity information based on the target co-occurrence information; analyzes the semantic alignment degree of the entity information, and calculates cross-modal consistency scores based on the semantic alignment degree combined with historical context adaptation factors. This allows for more accurate identification of entity information in multimodal data through multi-dimensional and multimodal feature analysis, and the calculation of cross-modal consistency scores in conjunction with historical context provides a more reliable foundation for subsequent knowledge graph construction and retrieval.

[0067] S12: Construct semantic anchors based on the cross-modal consistency scores, and generate knowledge events based on the semantic anchors;

[0068] In the specific implementation of this invention, the cross-modal consistency score is compared with a preset threshold, and semantic anchors are constructed based on the comparison results; information fragments in the multimodal archive data are identified, and semantic alignment between the information fragments and abstract concept information is performed using a projection network based on the semantic anchors to obtain semantic alignment results; auxiliary calibration information for the information fragments is determined, and the influence weight of the auxiliary calibration information is determined based on the heterogeneous modal feature calibration layer; knowledge events are generated based on the semantic alignment results, auxiliary calibration information, and the influence weight of auxiliary calibration information, effectively integrating multimodal information fragments into structured knowledge events, and introducing auxiliary calibration information to improve the accuracy and robustness of knowledge event generation.

[0069] S13: Obtain several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph;

[0070] In the specific implementation of this invention, several cross-modal verification information of the knowledge event is obtained, and the structure of the preset knowledge graph is optimized based on the cross-modal verification information to obtain the target knowledge graph. This can result in a more accurate and complete target knowledge graph. This optimization mechanism enables the knowledge graph to adapt to the ever-increasing multimodal data, thereby improving the accuracy and practicality of the knowledge graph.

[0071] S14: Obtain the query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent.

[0072] In the specific implementation of this invention, the query information input by the user is obtained, the query intent is determined based on the query information and the user's historical query habits, semantic extension information is extracted based on the target knowledge graph, and the query intent is disambiguated based on the semantic extension information, which greatly improves the clarity of the query intent and effectively solves the problem of low relevance when traditional retrieval models deal with fuzzy or ambiguous queries.

[0073] S15: Based on the disambiguation-processed query intent, several retrieval results are determined using the target knowledge graph, and the semantic paths of the disambiguation-processed query intent and each retrieval result are analyzed.

[0074] In the specific implementation of this invention, based on the disambiguated query intent, several search results are determined using a target knowledge graph. The historical evolution concept information of the disambiguated query intent and each search result is identified. Based on the context, the historical evolution concept information is used to generate a first contextualized semantic vector of the disambiguated query intent and a second contextualized semantic vector of each search result. A time-series consistency assessment is performed on the first contextualized semantic vector and each second contextualized semantic vector based on a time-series matching factor to obtain a time-series consistency assessment result. Historical period terms and colloquial expressions in the search results are identified, and standard semantic entity mapping is performed using the historical period terms and colloquial expressions based on a multi-dimensional mapping model to obtain standard semantic entity mapping information, and the confidence level of the standard semantic entity mapping information is determined. Based on the historical evolution concept information, the time-series consistency assessment result, the standard semantic entity mapping information, and the confidence level of the standard semantic entity mapping information, the semantic path between the disambiguated query intent and each search result is analyzed. This allows for a more comprehensive and accurate assessment of the semantic association between the query intent and the search results, improving the relevance of the search results.

[0075] S16: Calculate the relevance score of each search result based on the semantic path, and sort the search results based on the relevance score.

[0076] In the specific implementation of this invention, the length and relational importance of the semantic path are obtained, and the strength of the semantic path is calculated based on the length and relational importance of the semantic path; the semantic cluster information of the semantic path is obtained, and the relevance score of each search result is calculated based on the semantic cluster information and the strength of the semantic path; the search results are sorted based on the relevance score to achieve more accurate search result sorting.

[0077] In this embodiment of the invention, entity information of multimodal archive data is identified, and cross-modal consistency scores are calculated based on the entity information, effectively solving the semantic dispersion problem caused by the heterogeneity of multimodal data. Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on these anchors, achieving structured integration of multimodal information. Several cross-modal verification information of knowledge events is obtained, and the structure of a pre-defined knowledge graph is optimized based on this information, enabling the knowledge graph to dynamically adapt to the growth and changes in multimodal data. The query intent is determined based on query information combined with the user's historical query habits. Semantic extension information is extracted based on the target knowledge graph, and disambiguation processing is performed on the query intent based on this semantic extension information, significantly improving the ability to handle fuzzy or ambiguous user queries, thereby improving the relevance of search results. Based on the disambiguated query intent, several search results are determined using the target knowledge graph, and the semantic path between the disambiguated query intent and each search result is analyzed. Based on the semantic path, the relevance score of each search result is calculated, and the search results are ranked based on the relevance score. This provides more in-depth and comprehensive search results, significantly improving the relevance and coverage of the search results.

[0078] Example 2

[0079] Please see Figure 2 , Figure 2 This is a flowchart illustrating a knowledge graph-based multimodal retrieval method according to another embodiment of the present invention, the method comprising:

[0080] S201: Acquire multimodal archive data, identify entity information of the multimodal archive data, and calculate cross-modal consistency score based on the entity information;

[0081] In a specific implementation of this invention, the step of identifying entity information in the multimodal archival data and calculating cross-modal consistency scores based on the entity information includes: determining the terminology of the text data in the multimodal archival data, and calculating the co-occurrence frequency and semantic association strength of the terminology with other words in a preset vocabulary list; calculating the contextual uniqueness score of the terminology based on the co-occurrence frequency and semantic association strength; performing contrast enhancement processing on the visual data in the multimodal archival data to obtain contrast-enhanced visual data; determining the edge detection information of the contrast-enhanced visual data, and calculating the visual sharpness score based on the edge detection information. The process involves: performing speaker discrimination analysis on the audio data in the multimodal archive data to obtain speaker discrimination scores; performing converted text confidence analysis on the audio data in the multimodal archive data to obtain converted text confidence scores; generating confidence cues for the multimodal archive data based on the contextual uniqueness score, visual clarity score, speaker discrimination score, and converted text confidence scores; performing co-occurrence information analysis based on the confidence cues to obtain target co-occurrence information, and determining entity information based on the target co-occurrence information; analyzing the semantic alignment degree of the entity information, and calculating a cross-modal consistency score based on the semantic alignment degree combined with a historical context adaptation factor.

[0082] Specifically, this involves acquiring multimodal archival data, which refers to a collection of archives containing various data formats such as text, images, audio, and video. For example, a historical archive might contain a written record, an old photograph, an oral recording, and a related historical video clip.

[0083] The terminology of the text data in the multimodal archive data is determined. Terminology refers to words or phrases with relevant meanings in a specific domain or context. This determination can be performed using natural language processing techniques, such as lexical analysis and named entity recognition. The co-occurrence frequency and semantic association strength of the terminology with other words in a preset vocabulary list are calculated. Co-occurrence frequency refers to the number of times two words appear simultaneously in the text, while semantic association strength measures the degree of semantic relevance between words and can be obtained by calculating the cosine similarity between word embedding vectors using a word vector model.

[0084] The context uniqueness score of a term is calculated based on the co-occurrence frequency and semantic association strength. The context uniqueness score aims to evaluate the uniqueness and importance of a term in a specific context. Its calculation can take into account both co-occurrence frequency and semantic association strength. For example, if a word appears frequently in a specific context and has a strong semantic association with other related words, its context uniqueness score is high.

[0085] The visual data in the multimodal archive data is subjected to contrast enhancement processing to obtain contrast-enhanced visual data. The contrast enhancement processing aims to improve the image quality of the visual data and make its details more prominent. For example, histogram equalization, adaptive contrast enhancement and other techniques can be used.

[0086] Edge detection information is determined from the contrast-enhanced visual data. Edge detection information refers to areas in the image where brightness changes drastically, typically corresponding to the contours and boundaries of objects. This information can be obtained using edge detection algorithms such as Canny and Sobel. A visual sharpness score is then calculated based on this edge detection information. This score quantifies the sharpness of the visual data and can be calculated based on metrics such as the richness and sharpness of edge information. Richer and sharper edge information results in a higher visual sharpness score.

[0087] Speaker discrimination analysis is performed on the audio data in the multimodal archive data to obtain speaker discrimination scores. Speaker discrimination analysis aims to identify speech segments from different speakers in the audio data and evaluate their distinguishability. For example, speaker discrimination can be achieved using voiceprint recognition technology or clustering algorithms, and speaker discrimination scores are calculated based on the discrimination effect. Transformed text confidence analysis is also performed on the audio data in the multimodal archive data to obtain transformed text confidence. Transformed text confidence refers to the evaluation of the accuracy of the conversion result when converting audio data into text. For example, it can be obtained using the output confidence of a speech recognition model or through a multi-model voting mechanism.

[0088] Confidence cues for multimodal archival data are generated based on the aforementioned contextual uniqueness score, visual clarity score, speaker discrimination score, and transformed text confidence score. These confidence cues are a comprehensive assessment of the reliability and importance of each modality of information within the multimodal archival data. Their generation can be achieved through weighted summation, machine learning models, or other methods to fuse the aforementioned scores, reflecting the contribution and credibility of different modalities in identifying entity information. For example, high contextual uniqueness scores, high visual clarity scores, high speaker discrimination scores, and high transformed text confidence scores all contribute to improving the overall confidence cues.

[0089] Based on the aforementioned confidence cues, co-occurrence information analysis is performed to obtain target co-occurrence information. Co-occurrence information analysis refers to identifying information fragments that appear simultaneously or are closely related in different modalities within multimodal data, based on confidence cues. Entity information is then determined based on the target co-occurrence information. Entity information, such as names of people, places, organizations, and events, is identified from multimodal data through entity extraction, entity linking, and other techniques, building upon the target co-occurrence information.

[0090] The semantic alignment of the entity information is analyzed. Semantic alignment refers to the semantic consistency or relevance of entity information identified in different modalities. For example, whether the events described in the text are semantically consistent with the scenes presented in the visual representation. A cross-modal consistency score is calculated based on the semantic alignment and a historical context adaptation factor. The historical context adaptation factor is used to adjust the weight of the semantic alignment to adapt to semantic changes in different historical periods or contexts. For example, certain words or images may have different meanings in a specific historical period. The cross-modal consistency score is a quantitative assessment of the semantic consistency of entity information across different modalities in multimodal archival data. Its calculation comprehensively considers both semantic alignment and the historical context adaptation factor to ensure the accuracy and reliability of entity information, laying a solid foundation for subsequent semantic anchor construction and knowledge event generation.

[0091] Furthermore, the step of performing co-occurrence information analysis based on the confidence cues to obtain target co-occurrence information includes: determining target words for a preset historical period based on the confidence cues, and performing detailed light and shadow change analysis of visual cues based on the confidence cues to obtain detailed light and shadow change information; performing spoken keyword analysis of audio cues based on the confidence cues to obtain spoken keywords, and constructing a multidimensional association graph based on the target words, detailed light and shadow change information, and spoken keywords; determining dense connection subgraphs based on the multidimensional association graph, and performing co-occurrence information analysis based on the dense connection subgraphs to obtain target co-occurrence information.

[0092] Specifically, determining target words for a preset historical period based on the confidence cues involves analyzing the text content in multimodal archival data, combining the confidence cues to filter and weight words appearing in the text, thereby identifying words that are significant or highly relevant in a certain historical period. For example, historical dictionaries, timestamp information, or pre-trained language models can be used to assist in identifying these target words. Analyzing the details of light and shadow changes in visual cues based on the confidence cues yields detailed light and shadow change information. This analysis involves in-depth analysis of the visual content in multimodal archival data, paying particular attention to subtle changes in details such as lighting, shadows, and textures in images or videos. These changes may, in certain situations, indicate the occurrence of an event, a change in environment, or the appearance of an object. The confidence cues are then used to guide the degree of attention paid to these visual details and the assessment of their reliability.

[0093] Based on the confidence cues, spoken keyword analysis of audio cues is performed to obtain spoken keywords. This involves performing speech recognition and keyword extraction on the audio content in multimodal archive data. The confidence cues are used in this process to evaluate the accuracy of speech recognition and the reliability of keyword extraction, thereby identifying keywords with high information content or high relevance in the spoken content. A multidimensional association graph is constructed based on the target vocabulary, detailed light and shadow change information, and spoken keywords. This multidimensional association graph is a graph structure that integrates and associates relevant information extracted from different modalities (target vocabulary, detailed light and shadow change information, and spoken keywords). In this graph, nodes can represent these information elements, and edges represent their co-occurrence relationships or semantic association strength.

[0094] Identifying densely connected subgraphs based on the multidimensional association graph refers to identifying subgraph structures with tight connections and high internal correlations among nodes within the constructed multidimensional association graph. These densely connected subgraphs typically represent concepts or events that are highly co-occurring and semantically closely related across different modalities. For example, graph clustering algorithms or community detection algorithms can be used to identify these densely connected subgraphs. Analyzing co-occurrence information based on these densely connected subgraphs to obtain target co-occurrence information means that by analyzing the internal structure and node attributes of these densely connected subgraphs, it is possible to more accurately identify truly co-occurring, high-confidence information fragments in multimodal archival data. These information fragments will serve as the basis for subsequent entity information identification. By identifying densely connected subgraphs within the multidimensional association graph, noise and weak correlation information can be effectively filtered out, focusing on highly consistent and closely related co-occurrence patterns across multiple modalities. This allows for more accurate capture of deep-seated co-occurrence information in multimodal archival data, laying a solid foundation for the accurate identification of subsequent entity information.

[0095] S202: Compare the cross-modal consistency score with a preset threshold, and construct semantic anchors based on the comparison results;

[0096] In the specific implementation of this invention, the cross-modal consistency score is compared with a preset threshold. The purpose is to filter out information with sufficient cross-modal correlation strength to ensure the high reliability of the constructed semantic anchors. For example, a dynamic threshold can be set, which can be adaptively adjusted according to the characteristics of multimodal archive data or historical data to adapt to data features in different scenarios. When the cross-modal consistency score is higher than the preset threshold, it is considered that there is a strong semantic correlation between the corresponding multimodal information fragments, which can be used as the basis for constructing semantic anchors.

[0097] S203: Identify information fragments in multimodal archive data, and use a projection network based on the semantic anchors to perform semantic alignment between information fragments and abstract concept information to obtain semantic alignment results;

[0098] In a specific implementation of this invention, the step of semantically aligning information fragments with abstract concept information using a projection network based on the semantic anchors to obtain semantic alignment results includes: analyzing the semantic expression changes of abstract concept information using the semantic anchors based on a concept evolution trajectory tracker to obtain semantic expression change information; determining a loss function based on a semantic evolution adaptation term and constructing a projection network based on the loss function; and using the projection network to semantically align information fragments with abstract concept information based on the semantic expression change information to obtain semantic alignment results.

[0099] Specifically, information fragments are identified in multimodal archive data. Information fragments can be understood as the smallest units of information extracted from multimodal archive data that have independent semantics but are not yet fully structured, such as keywords and phrases in text, specific objects and scenes in visual data, or specific sound events and speech segments in audio data.

[0100] Based on the concept evolution trajectory tracker, the semantic anchors are used to analyze the semantic expression changes of abstract concept information, thereby obtaining semantic expression change information. The concept evolution trajectory tracker can be understood as a component specifically designed to monitor and record the semantic change paths of abstract concepts at different points in time or in different contexts. Its purpose is to capture the dynamic evolutionary characteristics of abstract concepts; for example, the meaning of a word or concept may differ in different historical periods. By utilizing the semantic anchors, the tracker can analyze the changes in the semantic expression of abstract concept information, thereby obtaining semantic expression change information. This semantic expression change information is a quantitative description of the semantic drift or evolution trend of abstract concepts in a specific context or time dimension.

[0101] The loss function is determined based on the semantic evolution adaptation term. The semantic evolution adaptation term refers to a parameter or mechanism used to measure and adjust the model's ability to adapt to the semantic evolution of abstract concepts. Its purpose is to ensure that the projection network can fully consider and adapt to the dynamic changes of abstract concepts when performing semantic alignment. The loss function is a mathematical expression used to evaluate the performance of the projection network in the semantic alignment task and guide the optimization of network parameters. Constructing a projection network based on this loss function, and incorporating the semantic evolution adaptation term into the loss function, can encourage the projection network to pay more attention to and adapt to the semantic evolution of abstract concepts during training. The projection network is a deep learning model designed to map information fragments and abstract concept information to a shared semantic space for semantic alignment.

[0102] Based on the semantic expression change information, a projection network is used to semantically align information fragments with abstract concept information, obtaining semantic alignment results. The projection network utilizes this change information for semantic alignment, meaning that when mapping, the projection network considers not only the semantics at the current moment but also the historical evolution trend of the abstract concept, thus obtaining more accurate semantic alignment results. Using a projection network that incorporates semantic expression change information for semantic alignment enables information fragments to match abstract concepts at a deeper and more accurate level, effectively solving the alignment problem caused by the semantic dynamics of abstract concepts.

[0103] S204: Determine the auxiliary calibration information of the information fragments, and determine the influence weight of the auxiliary calibration information based on the heterogeneous modal feature calibration layer;

[0104] In the specific implementation of this invention, auxiliary calibration information for information fragments is determined. Auxiliary calibration information refers to information, besides the core information fragments, that can supplement or correct the semantic understanding and alignment of the information fragments. For example, for visual information fragments, auxiliary calibration information may be metadata such as the shooting time, location, and shooting device; for textual information fragments, it may be their source, author, and publishing platform. The influence weight of the auxiliary calibration information is determined based on a heterogeneous modal feature calibration layer. This layer is a module specifically designed to process and integrate features from different modalities, capable of evaluating and determining the importance or influence weight of various auxiliary calibration information in the semantic alignment process. Through this calibration layer, the contribution of different auxiliary calibration information to the final semantic alignment result can be dynamically adjusted, thereby improving the accuracy and robustness of the alignment.

[0105] S205: Generate knowledge events based on the semantic alignment results, auxiliary calibration information, and the influence weights of the auxiliary calibration information;

[0106] In the specific implementation of this invention, knowledge events are generated based on the semantic alignment results, auxiliary calibration information, and the influence weights of the auxiliary calibration information. Knowledge events are structured semantic units containing elements such as time, location, participants, and behavior, enabling a more comprehensive and accurate description of real-world events contained in multimodal archival data. By integrating the semantic alignment results, auxiliary calibration information, and their influence weights, it is ensured that the generated knowledge events are not only semantically accurate but also possess rich contextual information and high credibility. This effectively improves the accuracy, completeness, and credibility of knowledge events. This multi-dimensional and refined processing method significantly enhances the robustness and effectiveness of extracting knowledge events from multimodal archival data.

[0107] S206: Obtain several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph;

[0108] In the specific implementation of this invention, several cross-modal corroboration information of the knowledge event are obtained. Cross-modal corroboration information refers to evidence information that mutually verifies and supports each other in different modal data, used to enhance the reliability of the knowledge event. For example, a knowledge event may contain a text description "someone delivered an important speech in a certain place," a video recording the scene of the speech, and an audio recording of the specific content of the speech. These different modal information corroborate each other, enhancing the reliability of the knowledge event. Based on the cross-modal corroboration information, the structure of the preset knowledge graph is optimized to obtain the target knowledge graph. The preset knowledge graph refers to a structured knowledge base containing knowledge of a certain domain that has been constructed before retrieval. Through this corroboration information, the structure of the preset knowledge graph can be optimized, for example, correcting inaccurate entity relationships in the knowledge graph, supplementing missing entities or attributes, or discovering new entity relationships. For example, if there is no clear relationship between "someone" and "a certain place" in the preset knowledge graph, but it is found through multimodal corroboration information that the person delivered a speech in that place, then the relationship "delivered the speech at" can be added to the knowledge graph. This enables knowledge graphs to adapt to the ever-increasing volume of multimodal data, improving their accuracy and practicality, and significantly outperforming traditional statically constructed knowledge graphs.

[0109] S207: Obtain the query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent.

[0110] In the specific implementation of this invention, the system acquires user-input query information, which can be in the form of text, voice, or images. Based on this query information and the user's historical query habits, the system determines the query intent. For example, if a user inputs "searching for information about the Long March," the system analyzes this query information and combines it with the user's historical query records to determine the user's true query intent. For example, if the user's historical query habits indicate a greater interest in military historical events, the query intent for "Long March" might lean more towards "the Red Army's Long March" than "the geographical Long March." Semantic extension information is extracted based on the target knowledge graph. This semantic extension information refers to information extracted from the knowledge graph that is related to the query intent and enriches the query semantics. Based on this semantic extension information, the query intent is disambiguated to obtain a disambiguated query intent. For example, for "the Red Army's Long March," relevant information such as participants, important locations, and time points can be extracted from the knowledge graph. This semantic extension information can be used to disambiguate the query intent. For example, if the user only inputs "apple," the semantic extension information can help distinguish between "fruit apple" and "Apple Inc.", and combined with the user's historical query habits, it can disambiguate to "Apple Inc." This significantly improves the clarity of query intent and effectively solves the problem of low relevance in traditional retrieval models when dealing with fuzzy or ambiguous queries. Through disambiguation, the system can more accurately understand user needs, thereby providing search results that better meet user expectations.

[0111] S208: Based on the disambiguation-processed query intent, several retrieval results are determined using the target knowledge graph, and the semantic paths of the disambiguation-processed query intent and each retrieval result are analyzed;

[0112] In the specific implementation of this invention, the analysis of the semantic path between the disambiguated query intent and each search result includes: identifying the historical evolution concept information of the disambiguated query intent and each search result; generating a first contextualized semantic vector of the disambiguated query intent and a second contextualized semantic vector of each search result based on the historical evolution concept information; performing a time-series consistency assessment on the first contextualized semantic vector and each second contextualized semantic vector based on a time-series matching factor to obtain a time-series consistency assessment result; identifying historical period terms and colloquial expressions in the search results, performing standard semantic entity mapping based on a multi-dimensional mapping model using the historical period terms and colloquial expressions to obtain standard semantic entity mapping information, and determining the confidence level of the standard semantic entity mapping information; and analyzing the semantic path between the disambiguated query intent and each search result based on the historical evolution concept information, the time-series consistency assessment result, the standard semantic entity mapping information, and the confidence level of the standard semantic entity mapping information.

[0113] Specifically, based on the disambiguated query intent, the system uses the target knowledge graph to determine several search results. When determining the search results, the system searches the target knowledge graph according to the disambiguated query intent to find entities and knowledge events related to the query intent. For example, for the disambiguated query intent "Zunyi Conference during the Long March of the Red Army", the system will retrieve archival data related to "Zunyi Conference" from the knowledge graph.

[0114] Identifying the historical evolution of query intent and search results after disambiguation refers to extracting entities, events, or topics that may have undergone semantic or conceptual changes over time through in-depth analysis of the content of query intent and search results. For example, timestamps, version information, or historical document databases can be used to track the meaning and relevance of specific concepts at different historical stages.

[0115] Based on the contextual information and utilizing the historical evolution concept information, a first contextualized semantic vector for the query intent and a second contextualized semantic vector for each search result are generated after disambiguation. This can be understood as integrating the historical evolution concept information in the query intent and search results into their respective contextual environments, and generating vector representations with rich contextual information through a natural language processing model. The first contextualized semantic vector is generated for the query intent, while the second contextualized semantic vector is generated for each search result, aiming to more accurately capture the true semantics of the query and search results in a specific context.

[0116] Evaluating the temporal consistency of the first-contextualized semantic vector and each of the second-contextualized semantic vectors based on a temporal context matching factor involves introducing a temporal context matching factor to measure the relevance or consistency between query intent and search results over time. For example, if the query intent relates to a historical event, and the search results also describe relevant content from the same period, the temporal consistency evaluation result will be high. This evaluation can be implemented using time series analysis algorithms or similarity calculation methods based on temporal distance.

[0117] Identifying historical terms and colloquial expressions in search results involves paying special attention to terms that may have different meanings in different historical periods or are no longer commonly used, as well as informal colloquial expressions, when processing search results. A multi-dimensional mapping model is used to map these historical terms and colloquial expressions to standard semantic entities, obtaining standard semantic entity mapping information. This multi-dimensional mapping model is used to map these non-standardized linguistic expressions to their corresponding standardized semantic entities in the target knowledge graph. For example, a word used in ancient literature may need to be mapped to its modern equivalent concept. The confidence level of the standard semantic entity mapping information is determined during the mapping process to evaluate the accuracy and reliability of the mapping.

[0118] Based on the historical evolution concept information, time series consistency assessment results, standard semantic entity mapping information, and confidence analysis of the standard semantic entity mapping information, the semantic paths of the disambiguated query intent and each retrieval result are analyzed. These analytical results are then integrated to conduct a comprehensive and in-depth analysis of the semantic path between the query intent and the retrieval results. This includes evaluating the evolutionary matching degree of concepts in the path, consistency over time, and the semantic association strength after non-standard language expressions are correctly mapped, thereby obtaining a more accurate semantic path analysis. Because this multi-dimensional information is comprehensively utilized, the semantic path analysis can more comprehensively and deeply reflect the true relationship between the query intent and the retrieval results.

[0119] Furthermore, the identification of historical period terms and colloquial expressions in the retrieval results, based on a multi-dimensional mapping model, utilizes these terms and expressions to perform standard semantic entity mapping to obtain standard semantic entity mapping information. This includes: constructing a multi-source heterogeneous vocabulary library, and identifying historical period terms and colloquial expressions in the retrieval results based on the multi-source heterogeneous vocabulary library combined with contextual semantics; constructing a multi-dimensional mapping model based on a polysemous mapping disambiguation decision tree, and mapping the historical period terms and colloquial expressions to the standard semantic entities corresponding to the target knowledge graph based on the multi-dimensional mapping model to obtain standard semantic entity mapping information.

[0120] Specifically, a multi-source heterogeneous vocabulary corpus is constructed. This corpus refers to establishing a collection of words from different sources and formats, such as historical documents, dialects, internet slang, and professional terminology. Its purpose is to comprehensively cover various non-standard or context-specific expressions that may appear in search results. Based on this multi-source heterogeneous vocabulary corpus, and combined with contextual semantic recognition, historical terms and colloquial expressions in the search results are identified. This means using the corpus as a reference, and combining it with the surrounding contextual information of the search result text, to accurately identify the words and expressions used in different historical periods or specific colloquial environments. The combination of contextual semantics helps distinguish polysemous words, ensuring the accuracy of identification.

[0121] A multi-dimensional mapping model is constructed based on a polysemous mapping disambiguation decision tree. This model designs a decision tree structure capable of handling situations where a word or expression may have multiple meanings in different contexts, and disambiguates based on contextual information. Based on this multi-dimensional mapping model, historical terms and colloquial expressions are mapped to corresponding standard semantic entities in the target knowledge graph, obtaining standard semantic entity mapping information. Using the constructed multi-dimensional mapping model, the identified historical terms and colloquial expressions are transformed or associated with predefined, standardized semantic entities in the target knowledge graph. For example, "old object" is mapped to the "cultural relic" entity in the knowledge graph. This achieves semantic standardization and unification, laying a solid foundation for subsequent semantic path analysis and association score calculation.

[0122] Furthermore, the construction of a multi-dimensional mapping model based on the polysemous mapping disambiguation decision tree includes: performing semantic recognition impact analysis on the historical terms and spoken expressions to obtain semantic recognition impact information; constructing an initial polysemous mapping disambiguation decision tree based on the semantic recognition impact information combined with time context matching factors and expert annotation information; obtaining the user's mapping option correction history and mapping option feedback history, and adjusting the initial polysemous mapping disambiguation decision tree based on the mapping option correction history and mapping option feedback history to obtain a target polysemous mapping disambiguation decision tree, and constructing a multi-dimensional mapping model based on the target polysemous mapping disambiguation decision tree.

[0123] Specifically, semantic recognition impact analysis is performed on the terms and spoken expressions from the aforementioned historical periods to obtain semantic recognition impact information. Semantic recognition impact analysis refers to assessing the ease with which these linguistic elements are correctly identified and understood in different contexts and their impact on overall semantics. For example, statistical analysis, machine learning models, or linguistic rules can be used to quantify the ambiguity, context dependence, and semantic drift of specific terms or spoken expressions throughout historical evolution, thereby obtaining semantic recognition impact information. This information aims to provide crucial input for the subsequent construction of decision trees, ensuring that the model can prioritize or more precisely process linguistic elements that have a greater impact on semantic recognition.

[0124] Based on the semantic recognition impact information, combined with temporal context matching factors and expert annotation information, an initial polysemous mapping disambiguation decision tree is constructed. The temporal context matching factor can be understood as an indicator measuring the semantic stability and relevance of a specific term or expression in different historical periods, for example, determined by analyzing its frequency of occurrence and co-occurrence patterns in different historical documents. Expert annotation information refers to data manually annotated and confirmed by domain experts on the semantics of terms or colloquial expressions. This data provides high-quality supervision signals for the initial structure and rules of the decision tree. By integrating this information, a preliminary decision tree model can be constructed, which can perform preliminary semantic disambiguation and mapping of terms and colloquial expressions from historical periods.

[0125] The system acquires the user's mapping option correction history and mapping option feedback history. The mapping option correction history records the modifications or adjustments made by the user to the mapping results provided by the system during actual use, reflecting the user's dissatisfaction with the model output. The mapping option feedback history refers to the evaluation information actively provided by the user regarding the correctness or preference of the mapping results. Based on these mapping option correction and feedback histories, the initial polysemous mapping disambiguation decision tree is adjusted to obtain the target polysemous mapping disambiguation decision tree. Through mechanisms such as reinforcement learning, online learning, or rule updates, the branch conditions, weights, or mapping rules of the decision tree are optimized based on user corrections and feedback, thereby obtaining a more accurate and robust target polysemous mapping disambiguation decision tree. A multi-dimensional mapping model is constructed based on this target polysemous mapping disambiguation decision tree to achieve accurate standard semantic entity mapping of historical terms and spoken expressions. This dynamic adjustment mechanism ensures that the multi-dimensional mapping model becomes more intelligent and accurate over time and with increased user interaction.

[0126] S209: Calculate the relevance score of each search result based on the semantic path, and sort the search results based on the relevance score.

[0127] In the specific implementation of this invention, the step of calculating the relevance score of each search result based on the semantic path includes: obtaining the length and relational importance of the semantic path, and calculating the strength of the semantic path based on the length and relational importance of the semantic path; obtaining the semantic cluster information of the semantic path, and calculating the relevance score of each search result based on the semantic cluster information and the strength of the semantic path.

[0128] Specifically, the length and relation importance of the semantic path are obtained. The length of the semantic path refers to the number of nodes or edges contained in the shortest path from the disambiguated query intent to a certain retrieval result in the target knowledge graph. The shorter the path, the more direct the correlation between the query intent and the retrieval result usually is. Relation importance refers to the preset weight of each relation in the semantic path in the knowledge graph or the importance learned through a machine learning model. For example, some relations may have higher importance than others. The strength of the semantic path is calculated based on the length and relation importance. The strength of the semantic path can be understood as a measure that comprehensively considers the path length and the importance of each relation on the path. Its purpose is to quantify the tightness of the semantic connection between the query intent and the retrieval result. For example, it can be calculated by accumulating or multiplying the relation importance and normalizing it in combination with the path length.

[0129] Semantic cluster information of semantic paths is obtained. Semantic cluster information refers to the locally densely connected subgraphs or concept groups formed by entities and relationships related to semantic paths in the target knowledge graph. These clusters reflect the knowledge structure of a specific topic or domain. For example, when both the query intent and the search results involve "historical events," they may share a semantic cluster containing entities such as "people," "places," and "times." Obtaining semantic cluster information can help assess the context of the semantic path, thereby more accurately determining relevance. Based on the semantic cluster information and the strength of the semantic path, the relevance score of each search result is calculated, which can quantify the direct relevance between the query intent and the search results at both structural and semantic levels. The shorter the path and the more important the relationship, the closer the semantic connection between the two. Based on this, by calculating the strength of the semantic path, these factors can be comprehensively considered to obtain a more accurate relevance measure. Furthermore, by obtaining the semantic cluster information of the semantic path, relevance can be assessed from a more macro-level knowledge background perspective, that is, to determine whether the query intent and the search results are within similar knowledge domains or topic categories. It is precisely because the structural features, semantic features, and contextual features of the path are comprehensively considered that the calculation of the relevance score is more comprehensive and accurate.

[0130] All search results are sorted based on the calculated relevance score, with the results boasting the highest relevance scores displayed first, thus presenting users with the most relevant information. This semantic path-based relevance assessment method reveals a deeper connection between query intent and search results, going beyond simple keyword matching. Compared to existing methods based on keyword matching or simple semantic similarity, this method provides more in-depth and comprehensive search results, significantly improving the relevance and coverage of the results.

[0131] In this embodiment of the invention, entity information of multimodal archive data is identified, and cross-modal consistency scores are calculated based on the entity information, effectively solving the semantic dispersion problem caused by the heterogeneity of multimodal data. Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on these anchors, achieving structured integration of multimodal information. Several cross-modal verification information of knowledge events is obtained, and the structure of a pre-defined knowledge graph is optimized based on this information, enabling the knowledge graph to dynamically adapt to the growth and changes in multimodal data. The query intent is determined based on query information combined with the user's historical query habits. Semantic extension information is extracted based on the target knowledge graph, and disambiguation processing is performed on the query intent based on this semantic extension information, significantly improving the ability to handle fuzzy or ambiguous user queries, thereby improving the relevance of search results. Based on the disambiguated query intent, several search results are determined using the target knowledge graph, and the semantic path between the disambiguated query intent and each search result is analyzed. Based on the semantic path, the relevance score of each search result is calculated, and the search results are ranked based on the relevance score. This provides more in-depth and comprehensive search results, significantly improving the relevance and coverage of the search results.

[0132] Example 3

[0133] Please see Figure 3 , Figure 3 This is a schematic diagram of the structural composition of a knowledge graph-based multimodal retrieval system according to an embodiment of the present invention. The system includes:

[0134] Score calculation module 31: used to acquire multimodal archive data, identify entity information of the multimodal archive data, and calculate cross-modal consistency score based on the entity information;

[0135] Event generation module 32: used to construct semantic anchors based on the cross-modal consistency score, and generate knowledge events based on the semantic anchors;

[0136] Knowledge graph optimization module 33: used to acquire several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph;

[0137] Intent disambiguation module 34: used to obtain query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent;

[0138] Semantic path analysis module 35: used to determine several search results based on the disambiguated query intent using the target knowledge graph, and analyze the semantic path between the disambiguated query intent and each search result;

[0139] Search and ranking module 36: used to calculate the relevance score of each search result based on the semantic path, and to rank each search result based on the relevance score.

[0140] In the specific implementation of this invention, the specific implementation methods of the system items can be referred to the implementation methods of the above-mentioned method items, and will not be repeated here.

[0141] In this embodiment of the invention, entity information of multimodal archive data is identified, and cross-modal consistency scores are calculated based on the entity information, effectively solving the semantic dispersion problem caused by the heterogeneity of multimodal data. Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on these anchors, achieving structured integration of multimodal information. Several cross-modal verification information of knowledge events is obtained, and the structure of a pre-defined knowledge graph is optimized based on this information, enabling the knowledge graph to dynamically adapt to the growth and changes in multimodal data. The query intent is determined based on query information combined with the user's historical query habits. Semantic extension information is extracted based on the target knowledge graph, and disambiguation processing is performed on the query intent based on this semantic extension information, significantly improving the ability to handle fuzzy or ambiguous user queries, thereby improving the relevance of search results. Based on the disambiguated query intent, several search results are determined using the target knowledge graph, and the semantic path between the disambiguated query intent and each search result is analyzed. Based on the semantic path, the relevance score of each search result is calculated, and the search results are ranked based on the relevance score. This provides more in-depth and comprehensive search results, significantly improving the relevance and coverage of the search results.

[0142] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, which may include: read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc.

[0143] Furthermore, the above provides a detailed description of a knowledge graph-based multimodal retrieval method and system provided by the embodiments of the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A multimodal retrieval method based on knowledge graphs, characterized in that, The method includes: Acquire multimodal archive data, identify entity information in the multimodal archive data, and calculate cross-modal consistency scores based on the entity information; Semantic anchors are constructed based on the cross-modal consistency scores, and knowledge events are generated based on the semantic anchors. Acquire several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph; Obtain the query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent. Based on the disambiguation-processed query intent, several retrieval results are determined using the target knowledge graph, and the semantic paths between the disambiguation-processed query intent and each retrieval result are analyzed. The relevance score of each search result is calculated based on the semantic path, and the search results are sorted based on the relevance score. The step of identifying entity information in the multimodal archival data and calculating cross-modal consistency scores based on the entity information includes: determining the terminology of the text data in the multimodal archival data, and calculating the co-occurrence frequency and semantic association strength of the terminology with other words in a preset vocabulary list; calculating the contextual uniqueness score of the terminology based on the co-occurrence frequency and semantic association strength; performing contrast enhancement processing on the visual data in the multimodal archival data to obtain contrast-enhanced visual data; determining the edge detection information of the contrast-enhanced visual data, and calculating the visual sharpness score based on the edge detection information; and further processing the... Speaker discrimination analysis is performed on the audio data in the multimodal archive data to obtain speaker discrimination scores. Transformed text confidence analysis is then performed on the audio data in the multimodal archive data to obtain transformed text confidence scores. Confidence cues for the multimodal archive data are generated based on the contextual uniqueness score, visual clarity score, speaker discrimination score, and transformed text confidence scores. Co-occurrence information analysis is performed based on the confidence cues to obtain target co-occurrence information, and entity information is determined based on the target co-occurrence information. The semantic alignment degree of the entity information is analyzed, and a cross-modal consistency score is calculated based on the semantic alignment degree combined with a historical context adaptation factor. The step of constructing semantic anchors based on the cross-modal consistency score and generating knowledge events based on the semantic anchors includes: comparing the cross-modal consistency score with a preset threshold and constructing semantic anchors based on the comparison results; identifying information fragments in the multimodal archive data, and using a projection network to perform semantic alignment between the information fragments and abstract concept information based on the semantic anchors to obtain semantic alignment results; determining auxiliary calibration information for the information fragments, and determining the influence weight of the auxiliary calibration information based on a heterogeneous modal feature calibration layer; and generating knowledge events based on the semantic alignment results, the auxiliary calibration information, and the influence weight of the auxiliary calibration information.

2. The knowledge graph-based multimodal retrieval method according to claim 1, characterized in that, The co-occurrence information analysis based on the confidence cues to obtain target co-occurrence information includes: Based on the confidence cues, target words for a preset historical period are determined, and based on the confidence cues, detailed light and shadow changes of visual cues are analyzed to obtain detailed light and shadow change information. Based on the confidence cues, the spoken keywords of the audio cues are analyzed to obtain spoken keywords, and a multi-dimensional association graph is constructed based on the target words, detailed light and shadow change information and spoken keywords. Based on the multidimensional association graph, a dense connection subgraph is determined, and co-occurrence information is analyzed based on the dense connection subgraph to obtain target co-occurrence information.

3. The multimodal retrieval method based on knowledge graphs according to claim 1, characterized in that, The semantic alignment of information fragments with abstract concept information using a projection network based on the semantic anchor points, to obtain semantic alignment results, includes: Based on the concept evolution trajectory tracker, the semantic anchors are used to analyze the semantic expression changes of abstract concept information to obtain semantic expression change information. The loss function is determined based on the semantic evolution adaptation term, and a projection network is constructed based on the loss function; Based on the semantic expression change information, a projection network is used to perform semantic alignment between information fragments and abstract concept information to obtain semantic alignment results.

4. The multimodal retrieval method based on knowledge graphs according to claim 1, characterized in that, The analysis of the query intent and semantic paths of each search result after disambiguation includes: Identify the query intent after disambiguation and the historical evolution of conceptual information for each search result; Based on the context, the historical evolution concept information is used to generate a first contextualized semantic vector of the query intent after disambiguation processing and a second contextualized semantic vector of each search result. Based on the time context matching factor, the time series consistency of the first contextualized semantic vector and each second contextualized semantic vector is evaluated to obtain the time series consistency evaluation results. Identify historical period terms and colloquial expressions in the search results, perform standard semantic entity mapping using the historical period terms and colloquial expressions based on a multi-dimensional mapping model, obtain standard semantic entity mapping information, and determine the confidence level of the standard semantic entity mapping information; Based on the historical evolution concept information, time series consistency assessment results, standard semantic entity mapping information, and confidence analysis of the standard semantic entity mapping information, the semantic path of the query intent and each retrieval result after disambiguation processing is determined.

5. The knowledge graph-based multimodal retrieval method according to claim 4, characterized in that, The historical terms and colloquial expressions in the identification and retrieval results are used to perform standard semantic entity mapping based on a multi-dimensional mapping model to obtain standard semantic entity mapping information, including: Construct a multi-source heterogeneous vocabulary library, and based on the multi-source heterogeneous vocabulary library, combine it with contextual semantic recognition to identify historical period terms and colloquial expressions in the retrieval results; A multi-dimensional mapping model is constructed based on a polysemous mapping disambiguation decision tree. Based on the multi-dimensional mapping model, the historical terms and spoken expressions are mapped to the standard semantic entities corresponding to the target knowledge graph, thereby obtaining standard semantic entity mapping information.

6. The multimodal retrieval method based on knowledge graphs according to claim 5, characterized in that, The construction of a multi-dimensional mapping model based on a polysemous mapping disambiguation decision tree includes: Semantic recognition impact analysis was performed on the terms and spoken expressions of the aforementioned historical period to obtain semantic recognition impact information; Based on the semantic recognition impact information, combined with the temporal context matching factor and expert annotation information, an initial polysemous mapping disambiguation decision tree is constructed. Obtain the user's mapping option correction history and mapping option feedback history, and adjust the initial polysemous mapping disambiguation decision tree based on the mapping option correction history and mapping option feedback history to obtain the target polysemous mapping disambiguation decision tree, and construct a multi-dimensional mapping model based on the target polysemous mapping disambiguation decision tree.

7. The multimodal retrieval method based on knowledge graphs according to claim 1, characterized in that, The calculation of the relevance score for each search result based on the semantic path includes: Obtain the length and relational importance of the semantic path, and calculate the strength of the semantic path based on the length and relational importance of the semantic path; Obtain semantic cluster information of semantic paths, and calculate the relevance score of each search result based on the semantic cluster information and the strength of semantic paths.

8. A multimodal retrieval system based on knowledge graphs, characterized in that, The system includes: Score calculation module: used to acquire multimodal archive data, identify entity information of the multimodal archive data, and calculate cross-modal consistency score based on the entity information; Event generation module: used to construct semantic anchors based on the cross-modal consistency score, and generate knowledge events based on the semantic anchors; Knowledge graph optimization module: used to acquire several cross-modal verification information of the knowledge event, and optimize the structure of the preset knowledge graph based on the cross-modal verification information to obtain the target knowledge graph; Intent disambiguation module: used to obtain query information input by the user, determine the query intent based on the query information and the user's historical query habits, extract semantic extension information based on the target knowledge graph, and perform disambiguation processing on the query intent based on the semantic extension information to obtain the disambiguated query intent; Semantic path analysis module: used to determine several search results based on the disambiguated query intent using the target knowledge graph, and to analyze the semantic path between the disambiguated query intent and each search result; Search and ranking module: used to calculate the relevance score of each search result based on the semantic path, and to rank each search result based on the relevance score; The step of identifying entity information in the multimodal archival data and calculating cross-modal consistency scores based on the entity information includes: determining the terminology of the text data in the multimodal archival data, and calculating the co-occurrence frequency and semantic association strength of the terminology with other words in a preset vocabulary list; calculating the contextual uniqueness score of the terminology based on the co-occurrence frequency and semantic association strength; performing contrast enhancement processing on the visual data in the multimodal archival data to obtain contrast-enhanced visual data; determining the edge detection information of the contrast-enhanced visual data, and calculating the visual sharpness score based on the edge detection information; and further processing the... Speaker discrimination analysis is performed on the audio data in the multimodal archive data to obtain speaker discrimination scores. Transformed text confidence analysis is then performed on the audio data in the multimodal archive data to obtain transformed text confidence scores. Confidence cues for the multimodal archive data are generated based on the contextual uniqueness score, visual clarity score, speaker discrimination score, and transformed text confidence scores. Co-occurrence information analysis is performed based on the confidence cues to obtain target co-occurrence information, and entity information is determined based on the target co-occurrence information. The semantic alignment degree of the entity information is analyzed, and a cross-modal consistency score is calculated based on the semantic alignment degree combined with a historical context adaptation factor. The step of constructing semantic anchors based on the cross-modal consistency score and generating knowledge events based on the semantic anchors includes: comparing the cross-modal consistency score with a preset threshold and constructing semantic anchors based on the comparison results; identifying information fragments in the multimodal archive data, and using a projection network to perform semantic alignment between the information fragments and abstract concept information based on the semantic anchors to obtain semantic alignment results; determining auxiliary calibration information for the information fragments, and determining the influence weight of the auxiliary calibration information based on a heterogeneous modal feature calibration layer; and generating knowledge events based on the semantic alignment results, the auxiliary calibration information, and the influence weight of the auxiliary calibration information.