Multimodal data unified retrieval method and system for cultural field
By constructing a cross-modal semantic representation space and a cultural semantic alignment mechanism, combined with cultural knowledge structure verification, the problems of insufficient semantic matching and missing information in multimodal cultural resource retrieval are solved, and more accurate and complete cultural resource retrieval is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHAANXI YUNCHUANG NETWORK TECH CO LTD
- Filing Date
- 2026-04-15
- Publication Date
- 2026-06-19
Smart Images

Figure CN122045265B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of information retrieval technology, and in particular to a unified method and system for multimodal data retrieval in the cultural field. Background Technology
[0002] With the development of digital technology, a large amount of cultural resources are gradually being stored and disseminated digitally. Cultural resource databases not only contain textual information such as documents and historical archives, but also data in various forms, including images of cultural relics, artworks, and related visual materials. To improve users' efficiency in retrieving and utilizing cultural resources, cultural information service systems are gradually evolving from traditional keyword search methods to intelligent search methods based on semantic analysis and machine learning technologies.
[0003] In the course of technological development, cultural resource retrieval technology has roughly gone through the following stages: Early systems mainly relied on text retrieval methods based on keyword matching, and achieved document retrieval by establishing an inverted index structure. However, this method could only process text information and was difficult to effectively utilize the image information in cultural resources. Subsequently, with the development of computer vision technology, image retrieval methods based on image feature extraction emerged. By analyzing the color, texture, or depth features of images, similarity retrieval between images was achieved. In recent years, with the development of deep learning and cross-modal representation learning technologies, multimodal retrieval technology has gradually emerged. By constructing a unified semantic space, different modal data such as text and images are mapped to the same feature space, thereby achieving semantic matching and retrieval between cross-modal data.
[0004] However, existing multimodal retrieval technologies still have certain limitations in cultural applications. First, cultural resources often possess distinct historical contexts and cultural semantic features, such as dynastic backgrounds, regional cultural characteristics, and artistic styles. This semantic information is difficult to accurately express using only general visual features or general semantic vectors, easily leading to biases in retrieval results at the cultural semantic level. Second, existing cross-modal retrieval methods typically rely solely on general semantic features when calculating vector similarity, lacking semantic alignment mechanisms for cultural attribute features. This results in some retrieval results being inconsistent with the user's search intent in terms of temporal context or regional attributes. Third, in the retrieval result output stage, existing methods usually directly return similarity ranking results, lacking further verification of the semantic connections between results and the completeness of cultural knowledge structures. This easily leads to duplicate resources or missing cultural attribute information, thus affecting the user's overall understanding of cultural resources.
[0005] Therefore, how to introduce a cultural semantic alignment mechanism into the multimodal retrieval process, and combine the spatiotemporal attribute information of cultural resources with the cultural knowledge structure to perform unified semantic matching and structured screening of retrieval results, thereby improving the semantic accuracy and cultural information integrity of cultural resource retrieval results, has become a key technical problem that urgently needs to be solved in this field. Summary of the Invention
[0006] To address the problems of insufficient semantic matching accuracy, limited ability to express cultural semantic features, and lack of structured verification of search results in the multimodal cultural resource retrieval process of existing technologies, this application provides a unified multimodal data retrieval method and system for the cultural field. By constructing a unified cross-modal semantic representation space and combining a cultural semantic alignment mechanism and a cultural knowledge structure verification mechanism, it realizes unified cross-modal retrieval of cultural resources, thereby improving the semantic accuracy and cultural information integrity of cultural resource retrieval results.
[0007] Firstly, this application provides a unified retrieval method for multimodal data in the cultural field, the method comprising:
[0008] S1. Obtain multimodal query data input by the user, and use a preset contrastive learning model to extract semantics from it to obtain the query semantic vector;
[0009] S2. Based on the query semantic vector, perform retrieval and matching in the pre-set multimodal vector index of the cultural resource database to obtain a candidate vector set;
[0010] S3. Project the query semantic vector and each candidate vector to the preset cultural semantic alignment space, use the preset weight matrix to reweight the dimensional components of different cultural attributes in the space, calculate the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and select the preliminary result set.
[0011] S4. Compare the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set with the historical intent information extracted from the multimodal query data to obtain a refined result set.
[0012] S5. Sort the refined result set based on the comprehensive similarity to determine the initial output sequence;
[0013] S6. Based on the semantic features of each resource in the initial output sequence, cluster and group them to construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, local deduplication is performed.
[0014] S7. Based on the preset integrity indicators, perform logical verification on the resource entities and their associated cultural knowledge chains in each theme cluster, and output the verified cultural resource retrieval results.
[0015] Secondly, this application provides a unified multimodal data retrieval system for the cultural field, the system comprising:
[0016] The semantic encoding module is used to acquire multimodal query data input by the user, extract semantics from it using a pre-set contrastive learning model, and obtain the query semantic vector.
[0017] The vector retrieval module is used to perform retrieval and matching in the pre-set multimodal vector index of the cultural resource database based on the query semantic vector to obtain a candidate vector set;
[0018] The cultural alignment module projects the query semantic vector and each candidate vector to a preset cultural semantic alignment space. It uses a preset weight matrix to reweight the dimensional components of different cultural attributes in the space, calculates the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and selects a preliminary result set.
[0019] The intent refinement module is used to perform semantic consistency comparison between the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set and the historical intent information extracted from the multimodal query data to obtain a refined result set.
[0020] The similarity sorting module is used to sort the refined result set according to the comprehensive similarity and determine the initial output sequence;
[0021] The clustering and deduplication module is used to cluster and group resources according to the semantic features of each resource in the initial output sequence, and construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, local deduplication is performed.
[0022] The verification output module is used to perform logical verification on resource entities and their associated cultural knowledge chains in each theme cluster by combining preset integrity indicators, and output the cultural resource retrieval results that have passed the verification.
[0023] Compared with the prior art, the beneficial effects of the technical solution of this application are at least as follows:
[0024] 1. By introducing a cross-modal joint embedding mechanism based on a contrastive learning model, textual description information and image visual features are mapped to a unified semantic space, enabling cultural resources of different modalities to be matched and retrieved under the same semantic dimension. This effectively reduces the semantic deviation between textual and image information and improves the semantic matching ability between multimodal cultural resources.
[0025] 2. By constructing a cultural semantic alignment space and using a weight matrix that integrates culturally unique concepts to reweight the different dimensional components of the semantic vector, the retrieval process can more accurately reflect the specific cultural attributes and characteristics of cultural resources, such as historical dynasties, regional cultural characteristics, and artistic styles, thereby enhancing the expressive power of cultural semantic information in the retrieval process and improving the semantic accuracy of cultural resource retrieval.
[0026] 3. By introducing a consistency comparison mechanism between cross-modal spatiotemporal metadata and historical intent information, candidate cultural resources are not only related to the query content in terms of semantic similarity, but also consistent in cultural attribute dimensions such as time background and spatial region. This reduces the problem of cultural resources that are inconsistent with the user's search intent being mistakenly retrieved, and improves the reliability of cultural resource retrieval results.
[0027] 4. By performing semantic clustering on the search results and constructing topic clusters, and performing local deduplication on resources within the clusters while ensuring semantic relevance, the interference of duplicate resources on the search results can be effectively reduced, thereby improving the structuring of the search results and the browsing efficiency of users.
[0028] 5. By combining cultural knowledge graphs, the integrity of the cultural knowledge chains associated with each cultural resource entity is verified, thereby identifying and eliminating resources with missing cultural attribute nodes or historical spatiotemporal logical conflicts, improving the integrity of the search results at the cultural knowledge structure level, and making the final output cultural resources more systematic and interpretable.
[0029] In summary, this invention achieves unified retrieval of multimodal data in the cultural field by integrating cross-modal semantic representation technology, cultural semantic alignment mechanism, and cultural knowledge structure verification mechanism. This not only improves the semantic matching accuracy of cultural resource retrieval results but also enhances the cultural semantic integrity and knowledge structure consistency of the retrieval results, thereby effectively improving the retrieval quality and user experience in the digital service system for cultural resources. Attached Figure Description
[0030] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0031] Figure 1 This is a flowchart of the unified multimodal data retrieval method for the cultural field proposed in this application;
[0032] Figure 2 This is a schematic diagram of the resource vector distribution in the general joint semantic space of existing technologies;
[0033] Figure 3 This is a schematic diagram of the vector distribution after cultural semantic alignment and reweighting in an embodiment of this application;
[0034] Figure 4 This is a schematic diagram of the structure of the unified multimodal data retrieval system for the cultural field proposed in this application. Detailed Implementation
[0035] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms “comprising” or “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0036] For ease of understanding, the specific process of the embodiments of this application is described below. Figure 1 The diagram shows a flowchart of the unified multimodal data retrieval method for the cultural field provided by this invention. The flowchart specifically includes the following steps:
[0037] S1. Obtain multimodal query data input by the user, and extract its semantics using a pre-defined contrastive learning model to obtain the query semantic vector.
[0038] In one specific embodiment, the process of performing step S1 may specifically include the following steps:
[0039] The multimodal query data is parsed to obtain text description features and image visual features, which are then input into the text encoder and image encoder in the contrastive learning model to extract unimodal feature vectors.
[0040] The single-modal feature vector is projected onto the unified joint semantic space constructed by the contrastive learning model to obtain cross-modal embedding representations. The unified joint semantic space is generated by pre-training and aligning culturally specific concept samples through a contrastive learning strategy that brings the positive sample pairs of different modalities under the same cultural concept closer and pushes the negative sample pairs of different modalities under different cultural concepts further apart.
[0041] Based on preset modality fusion weights, multimodal feature fusion processing is performed on the cross-modal embedding representation to output a query semantic vector.
[0042] Specifically, the system acquires multimodal query data entered by users on the cultural resource retrieval platform. This multimodal query data includes text descriptions, image content, or a combination of text and images. For example, a user might enter the text "flying apsaras image in Tang Dynasty Dunhuang murals" and upload a photo of a flying apsaras mural; or a user might only enter the text "Yuan Dynasty blue and white dragon-patterned plum vase"; or a user might only upload a partial photo of a bronze artifact.
[0043] The system performs parsing processing on multimodal query data. For the text portion, it receives the user-input search query, extracts the character sequence, and performs word segmentation. During word segmentation, it retains proprietary expressions related to cultural semantics, including dynasty names, vessel types, pattern names, theme names, craft names, and regional names. Stop words, repeated punctuation, and non-search symbols are removed. Preferably, the text length is truncated to no more than 76 tokens to fit the input length of the text encoder. For the image portion, it receives a single image uploaded by the user, performs preprocessing on the image, including scaling the image to 224×224 pixels, reading it in three-channel RGB format, and normalizing the pixel values to distribute them within the [0,1] range. If the image contains a background border, watermark, or camera information, the main subject area is extracted using the object detection module and used as input for subsequent processing.
[0044] The preprocessed text and image data are input into the contrastive learning model, respectively. The contrastive learning model consists of a text encoder, an image encoder, and two projectors. For example, the text encoder uses a Transformer-based architecture with 12 Transformer blocks, each containing a multi-head self-attention sub-layer and a feedforward sub-layer. The hidden layer dimension is 768, and the number of self-attention heads is 12. The image encoder uses a residual network structure with 50 convolutional layers. The final output feature map is then subjected to global average pooling to obtain a 2048-dimensional feature vector. The text encoder outputs a unimodal text feature vector with a dimension of 768. The image encoder outputs a unimodal image feature vector with a dimension of 2048. Each encoder is followed by a projector, which consists of two fully connected layers. The first layer maps the feature vector to 512 dimensions, and the second layer maps it to 128 dimensions. After output, L2 normalization is performed to obtain the final unimodal feature vector.
[0045] The contrastive learning model requires pre-training to construct a unified joint semantic space. The training process uses culture-specific concept samples to build the dataset. Culture-specific concept samples refer to text-image pairs under the same cultural concept; for example, a "Tang Dynasty gilt bronze Buddha statue" includes a photograph of the Buddha statue and its text description, while a "Song Dynasty celadon lotus petal bowl" includes an image of the object and its descriptive text. Each paired sample is labeled with a cultural concept tag, including era tags, category tags, pattern tags, and material tags. Text-image pairs under the same cultural concept are used as positive sample pairs, and text-image pairs under different cultural concepts are used as negative sample pairs. For example, during training, 256 positive sample pairs are sampled per batch, and negative samples are generated through random combination. The InfoNCE contrastive loss function is used to calculate the similarity score between each text encoding result and the image encoding results in the same batch; diagonal positions correspond to positive samples, and other positions correspond to negative samples. When calculating the loss function, the embedding vector of each sample is normalized, and the cosine similarity is calculated and then scaled using a temperature parameter set to 0.07. The optimizer used was Adam, with an initial learning rate of 0.001 and a weight decay coefficient of 0.01. The training run consisted of 60 epochs. After training, the text encoder, image encoder, and projector jointly defined a mapping from the raw data to a unified joint semantic space, where text and image embeddings of the same cultural concept were close in distance, while those of different cultural concepts were far apart.
[0046] In the inference phase of S1, the parsed text and image data are input into the trained contrastive learning model. The text is processed by a text encoder and a projector to obtain a 128-dimensional text unimodal feature vector, which has been L2 normalized. The image is processed by an image encoder and a projector to obtain a 128-dimensional image unimodal feature vector, which has also been L2 normalized. Both vectors are now in a unified joint semantic space, called the cross-modal embedding representation. If the query contains only text, the image cross-modal embedding representation is set to zero; if the query contains only images, the text cross-modal embedding representation is set to zero; if the query contains both text and images, both vectors are non-zero.
[0047] After obtaining the cross-modal embedding representation, multimodal feature fusion processing is performed. The fusion processing is based on preset modal fusion weights, which are set according to the query scenario. When the user input is primarily text description with accompanying reference images, the text weight is set to 0.6 and the image weight to 0.4; when the user input is primarily image retrieval with accompanying textual constraints, the text weight is set to 0.4 and the image weight to 0.6; when only a single modality is input, the corresponding modality weight is set to 1, and the other modality weight is set to 0. The fusion processing uses a weighted average method, where the fused vector equals the text weight multiplied by the text cross-modal embedding representation plus the image weight multiplied by the image cross-modal embedding representation. The fused vector is then subjected to L2 normalization again to obtain the final query semantic vector. This query semantic vector simultaneously contains both the text-defined cultural concept information and the visual semantic information carried by the image.
[0048] For example, a user enters the text "Flying Apsaras image in Tang Dynasty Dunhuang murals" into the platform and uploads a photo of a flying apsaras mural. After text parsing, cultural semantic fragments such as "Tang Dynasty," "Dunhuang," "mural," and "flying apsara" are preserved. After image preprocessing, the outline and clothing folds of the flying apsaras are preserved. The text is encoded and projected to obtain a cross-modal embedding representation, which is close to regions such as "Tang Dynasty figure clothing" and "Dunhuang mural image" in the unified joint semantic space. The image is encoded and projected to obtain a cross-modal embedding representation, which is close to regions such as "detail of clothing folds in the mural" and "Tang Dynasty painting style." These are fused with a text weight of 0.6 and an image weight of 0.4 to obtain a query semantic vector. This vector can be used in subsequent steps to retrieve text and image resources related to "Flying Apsaras in Tang Dynasty Dunhuang murals" from the cultural resource database.
[0049] S2. Based on the query semantic vector, perform retrieval and matching in the pre-set multimodal vector index of the cultural resource database to obtain a candidate vector set.
[0050] In one specific embodiment, the multimodal vector index is a high-dimensional spatial index with a unified dimension, constructed by pre-mapping the text and image data in the cultural resource database using an embedding extraction model, adjusting the vector dimension through a preset dimensional transformation strategy to reduce semantic bias between modalities, and then executing step S2. The specific steps include the following:
[0051] A dimension transformation strategy is used to perform dimension alignment on the query semantic vector to obtain the mapped query vector;
[0052] Based on the mapping query vector, an approximate nearest neighbor search is performed in the multimodal vector index. The spatial metric distance between the mapping query vector and each data vector in the high-dimensional spatial index is calculated, and a subset of data whose spatial metric distance meets the preset conditions is selected as a candidate vector set.
[0053] Specifically, the cultural resource database pre-stores massive amounts of cultural resources, including textual descriptions, image data, and corresponding semantic vectors. The problem this step aims to solve is to quickly retrieve and query resources with semantically similar vectors from a large-scale dataset containing millions or even tens of millions of cultural resources.
[0054] To achieve the above objectives, a multimodal vector index needs to be pre-constructed. The construction process is as follows: First, an embedding extraction model is used to perform feature mapping on the text and image data in the cultural resource database. This model structure is consistent with the contrastive learning model of S1, including a text encoder, an image encoder, and a projection head. The text encoder is a 12-layer Transformer with 768 hidden layers, the image encoder is a ResNet-50, and the projection head has two fully connected layers. The final output is a 128-dimensional vector, which is then normalized. For each resource, if it contains text, a text vector is extracted; if it contains an image, an image vector is extracted; if it contains both text and an image, they are weighted and fused according to modality fusion weights to obtain a unified 128-dimensional semantic vector. Because text and image modalities differ in their original feature distributions, even after contrastive learning training, subtle distribution shifts may still exist between text and image semantic vectors in the joint space. To reduce the bias between text and image modalities and unify vector scale, a dimensionality transformation strategy is applied to all resource vectors. Principal component analysis is used to calculate the covariance matrix and perform eigenvalue decomposition. The top k eigenvectors with a cumulative variance contribution rate exceeding 90% are selected to form the transformation matrix. The original vectors are multiplied by the transformation matrix to obtain k-dimensional reduced vectors, preserving key semantic information and eliminating noise, thus ensuring consistent vector distribution across different modalities. After dimensionality transformation, the reduced vectors are used to construct a high-dimensional space index. For example, the index structure employs a hierarchical navigable small-world graph, which supports efficient approximate nearest neighbor search. When constructing the index, the maximum number of connections in the graph is set to 32, and the candidate set size is set to 100. Each vector is inserted as a node in the index, and connections are established between nodes based on Euclidean distance, forming a navigable graph structure. After the index is constructed, a unified-dimensional multimodal vector index is obtained, which stores the dimensionality-reduced vectors of all resources in the cultural resource library and their corresponding resource identifiers.
[0055] In the online retrieval phase of S2, the same dimensional transformation strategy is first used to align the dimensions of the query semantic vector. For example, the query semantic vector output by S1 has 128 dimensions, which needs to be reduced to 64 dimensions using the same transformation matrix as during index construction. The transformation matrix has already been saved during the index construction phase. Multiplying the query semantic vector by the transformation matrix yields the mapped query vector, which has 64 dimensions, consistent with the dimensions of the data vectors in the index.
[0056] Approximate nearest neighbor search is based on a search algorithm for hierarchical navigable small-world graphs. During the search, the candidate set size is set, such as 100. The search process starts from the indexed entry node, calculates the spatial metric distance between the mapped query vector and the current node vector, and moves along the graph structure towards the nearest neighbor node. The spatial metric distance uses Euclidean distance. A dynamic list of size equal to the candidate set size is maintained during the search, storing the vector with the smallest distance among the currently visited nodes and its corresponding distance value. When no closer node can be found, the search terminates, and the vectors in the list represent the approximate nearest neighbor search results.
[0057] After the search is complete, the spatial metric distance between the mapped query vector and each data vector in the index is obtained, and the top N vectors with the smallest distance are selected as candidate results. A preset condition can be set to retrieve the top 500 vectors with the smallest distance. Based on this condition, the 500 data vectors with the smallest spatial metric distance are selected from the search results to form a data subset. Each data vector records the identifier of its corresponding original cultural resource when the index is built; therefore, the resource identifier corresponding to each vector can be extracted from the data subset, and these resource identifiers can be output as a candidate vector set.
[0058] For example, the cultural resource database contains 2 million cultural resources, including images and descriptions of "Cave 45 of the Tang Dynasty Dunhuang murals," "Cup of Lotus Petal Pattern from the Song Dynasty," and "Plum Vase with Dragon Pattern from the Yuan Dynasty," etc. The query semantic vector output by S1 corresponds to the query "Flying Apsaras Images in the Tang Dynasty Dunhuang Murals." Multiplying this query semantic vector by the transformation matrix yields a 64-dimensional mapped query vector. An approximate nearest neighbor search is performed in the multimodal vector index to retrieve the 500 data vectors with the smallest distance. The cultural resources corresponding to these vectors include "Cave 45 of the Tang Dynasty Dunhuang murals," "Cave 57 of the Tang Dynasty Dunhuang murals," "Tang Dynasty Dunhuang silk paintings," "Catalogue of Tang Dynasty Mural Costumes," "Research Literature on Flying Apsaras Images in Dunhuang," and "Analysis Report on Pigments in Tang Dynasty Murals," etc. These resources constitute the candidate vector set. Through a pre-constructed index and approximate nearest neighbor search, the retrieval range is rapidly reduced from two million to five hundred, significantly reducing the computational load of subsequent processing while ensuring that the resources most relevant to the query semantic vector are effectively retrieved.
[0059] S3. Project the query semantic vector and each candidate vector to the preset cultural semantic alignment space, use the preset weight matrix to reweight the dimensional components of different cultural attributes in the space, calculate the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and select the preliminary result set.
[0060] In one specific embodiment, the process of performing step S3 may specifically include the following steps:
[0061] Project the query semantic vector and each candidate vector in the candidate vector set onto the cultural semantic alignment space to obtain the corresponding spatial projection vector;
[0062] Using a pre-defined weight matrix that incorporates culturally unique concepts, the dimensional components representing different cultural attributes in each spatial projection vector are weighted and adjusted, and the reweighted query semantic vector and the reweighted candidate vector are output.
[0063] The cosine similarity between the reweighted query semantic vector and each reweighted candidate vector in the cultural semantic alignment space is defined as the comprehensive similarity.
[0064] Determine whether the comprehensive similarity of each candidate vector meets the preset screening conditions, and construct a preliminary result set based on the cultural resources mapped by the candidate vectors that meet the preset screening conditions.
[0065] Specifically, general similarity calculations fail to reflect the importance of cultural attributes such as dynasty, region, pattern, and material in cultural resources. This leads to resources that match the query in terms of cultural connotation being missed due to differences in visual features, while resources that are visually similar but have different cultural meanings may be falsely detected. Therefore, guiding the retrieval process to the cultural semantic dimension allows cultural attributes to receive higher weight in similarity calculations.
[0066] A cultural semantic alignment space and a cultural weight matrix are pre-constructed. The cultural semantic alignment space is a semantic space further optimized for cultural attribute dimensions based on the unified joint semantic space of S1. The construction method of this space is as follows: Training samples with cultural attribute annotations are collected. Each sample includes a cultural resource image and its text description, and is labeled with tags such as dynasty, region, material, pattern, and artifact type. For example, for the sample "Flying Apsaras from Cave 45 of the Dunhuang Mural in the Tang Dynasty," the dynasty label is "Tang Dynasty," the region label is "Dunhuang," the material label is "mural," the pattern label is "Flying Apsaras," and the artifact type label is "grotto statue." A multi-task contrastive learning framework is used for training. The training model includes a text encoder, an image encoder, and a projection head, with a structure consistent with S1. A cultural attribute classification branch is added to the training loss, and a fully connected layer is added to the projected vectors to predict the cultural attribute categories such as dynasty, region, material, pattern, and artifact type. The loss function consists of a weighted average of contrastive loss and attribute classification loss. Contrastive loss shortens the distance between positive samples representing the same cultural concept, while attribute classification loss allows the attribute-related dimensions of the vector to carry richer attribute information. During training, the batch size is set to 256, the contrastive loss parameter is 0.07, and the attribute classification loss weight is 0.3. Model fine-tuning through the loss function reduces the spatial metric distance between different modalities representing the same culturally unique concept in the high-dimensional vector space, while widening the distance between modalities representing different cultural concepts. This results in a high-dimensional joint embedding feature space that eliminates the modality gap and possesses cultural attribute representation capabilities. After training, the output vector space of the cultural semantic mapping layer, further trained based on the unified joint semantic space, becomes the cultural semantic alignment space. In this space, the sensitivity of each dimension of the vector to cultural attributes varies; some dimensions respond more strongly to dynastic information, while others respond more strongly to pattern information.
[0067] The cultural weight matrix is used to perform weighted adjustments on the components of each dimension, giving greater attention to specific cultural attributes during retrieval. The construction of the cultural weight matrix is based on statistical analysis of the response intensity of cultural attributes across different dimensions. A batch of validation samples with clearly defined cultural attributes is selected, including cultural resources from different dynasties, regions, and materials. Each sample is input into the cultural semantic alignment space to obtain its spatial projection vector. For each sample, the value of its vector in each dimension is recorded. For each cultural attribute category, such as "Tang Dynasty," the average value of all sample vectors belonging to that category in each dimension is calculated to obtain the dimension response mean vector; simultaneously, the average value of sample vectors not belonging to that category in each dimension is calculated to obtain the dimension response baseline vector. The importance weight of each dimension to the attribute "Tang Dynasty" is determined by the degree of difference between the category mean and the baseline mean in that dimension; the greater the difference, the higher the weight. The above statistics are performed on all cultural attribute categories to obtain the weight vector corresponding to each attribute category. The weight vectors of multiple attributes are then maximized or summed by dimension to obtain the comprehensive cultural weight matrix. In actual calculations, this matrix is represented as a weight vector with the same dimensions as the vector, and its length is consistent with the vector dimensions. Each element represents the amplification factor of the corresponding dimension when reweighted. For example, if dimension 32 responds strongly to dynastic information, it is assigned a higher weight value (e.g., 1.8); if dimension 64 responds strongly to pattern information, it is assigned a higher weight value (e.g., 1.6); and for dimensions such as dimension 15 and dimension 82, which do not respond to cultural attributes, a basic weight value of 1.0 is assigned.
[0068] During online processing in step S3, the candidate resource identifier set output in step S2 is first obtained. This set contains several resource identifiers obtained through approximate nearest neighbor search. Based on these resource identifiers, the corresponding original semantic feature vectors are read from the cultural resource database. These original semantic feature vectors are 128-dimensional vectors stored when the resources were added to the database, and have already undergone L2 normalization.
[0069] The query semantic vector and the retrieved original semantic feature vectors are projected onto the cultural semantic alignment space. The projection operation is implemented through a pre-trained cultural semantic mapping layer. This layer has the same structure as the projection head in S1, but different parameters—it is fine-tuned by introducing cultural attribute labels such as dynasty, region, and material based on the S1 model through multi-task adjustments. The query semantic vector is input into the cultural semantic mapping layer, and the output is the spatial projection vector in the cultural semantic alignment space, with a dimension of 512, and L2 normalization is performed. Each original semantic feature vector is also input into the cultural semantic mapping layer to obtain a corresponding 512-dimensional spatial projection vector. All projection vectors have the same dimension and have been normalized.
[0070] The cultural weight matrix is a one-dimensional vector with the same length as the projection vector (e.g., 512 dimensions), and each element's value ranges from 0.5 to 2.0. Weighting adjustment uses element-wise multiplication, multiplying each dimensional component of the query projection vector by the corresponding element of the weight matrix to obtain the reweighted query vector. The same element-wise multiplication is performed on each candidate projection vector to obtain the reweighted candidate vector. After weighting adjustment, dimensions with strong cultural attribute responses are amplified, while those with weak responses are suppressed. After reweighting, L2 normalization is performed again on the reweighted vector to maintain consistent vector magnitude.
[0071] The cosine similarity between the reweighted query vector and each reweighted candidate vector is calculated. The result is defined as the comprehensive similarity, ranging from -1 to 1. A higher value indicates that the two vectors are closer in direction in the cultural semantic alignment space, i.e., the higher the cultural semantic relevance. Preset filtering conditions can be set to a comprehensive similarity greater than a threshold, such as 0.7; or to select the top N (e.g., 200) candidate vectors with the highest comprehensive similarity. Based on the filtering conditions, vectors that meet the conditions are selected from the currently participating candidate vectors. Each vector corresponds to a resource identifier. The cultural resources mapped by the candidate vectors that meet the conditions are collected to construct a preliminary result set. The preliminary result set contains multiple cultural resources, each resource including a resource identifier and its corresponding text description, image file path, metadata, etc.
[0072] For example, the input query for S1 is "flying apsaras images in Tang Dynasty Dunhuang murals," and the output for S2 is 500 candidate vectors. Based on these identifiers, the corresponding 500 original 128-dimensional semantic feature vectors are read, mapped to a 512-dimensional space via a projection head, and then reweighted and cosine similarity is calculated to obtain the comprehensive similarity between each candidate vector and the query vector. The filtering condition is set to a comprehensive similarity greater than 0.75, resulting in 42 candidate vectors. The cultural resources corresponding to these vectors include "flying apsaras images from Cave 45 of the Tang Dynasty Dunhuang murals," "flying apsaras paintings on silk from the Tang Dynasty Dunhuang," "Catalogue of flying apsara costumes from Tang Dynasty murals," and "Research literature on the image of flying apsaras in Dunhuang," etc. These resources are then compiled to form a preliminary result set.
[0073] To visually illustrate the role of spatial transformation and weight adjustment in the semantic representation of cultural resources in this application, please refer to [link / reference needed]. Figure 2 and Figure 3 . Figure 2 This is a schematic diagram of existing technology, showing the two-dimensional visualization distribution of mural resource vectors from different dynasties in a general joint semantic space. Among them, the resource vectors of Tang Dynasty murals, Song Dynasty murals and Yuan Dynasty murals have a lot of overlap and the boundaries between classes are not clear. Figure 3This diagram illustrates the two-dimensional visualization of the distribution of query vectors and candidate resource vectors projected onto a cultural semantic alignment space and reweighted using a weight matrix. It shows that resource vectors from the same dynasty exhibit increased clustering, while the separation between resource vectors from different dynasties increases. Query vectors represented by pentagrams (e.g., "Tang Dynasty Dunhuang murals") are closer to the Tang Dynasty mural resource cluster area compared to resource clusters from other dynasties. This diagram demonstrates that cultural semantic alignment and reweighting help strengthen the representation of cultural attributes such as dynasty in the vector space, reducing confusion between different cultural categories.
[0074] S4. Compare the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set with the historical intent information extracted from the multimodal query data to obtain a refined result set.
[0075] In one specific embodiment, the process of performing step S4 may specifically include the following steps:
[0076] Obtain cross-modal spatiotemporal metadata pre-bound to each cultural resource in the preliminary results set;
[0077] Analyze multimodal query data to extract historical intent information that represents the user's search needs. The historical intent information includes time-dimensional intent and spatial-dimensional intent.
[0078] The cross-modal spatiotemporal metadata and historical intent information are mapped to a preset historical spatiotemporal knowledge graph. The temporal period overlap between the temporal dimension intent and the cross-modal spatiotemporal metadata, as well as the spatial region overlap between the spatial dimension intent and the cross-modal spatiotemporal metadata, are calculated respectively.
[0079] Determine whether the temporal overlap and spatial overlap of each cultural resource meet the preset consistency verification threshold. If so, retain the corresponding cultural resource and construct a refined result set.
[0080] Specifically, the preliminary results contain multiple cultural resources, each of which has been filtered for cultural semantic similarity. However, it has not yet been verified whether their temporal and spatial characteristics match the user's query intent. Cultural resource retrieval not only requires semantic similarity but also needs to be consistent with the user's query intent in terms of temporal context and spatial location. For example, when a user queries "Tang Dynasty Dunhuang murals," the search results should be limited to the Tang Dynasty period and the Dunhuang region to avoid including mural resources from other dynasties or other regions.
[0081] Each resource in the cultural resource database is pre-assigned cross-modal spatiotemporal metadata. This metadata includes temporal and spatial information. Temporal information can be further subdivided into specific dates, dynasties, periods, or time ranges, while spatial information can be subdivided into excavation location, collection location, cultural sphere, geographical coordinates, or region name. For example, for the resource "Tang Dynasty Dunhuang Mural Cave 45," the associated temporal metadata would be "Tang Dynasty (618-907 AD)," and the spatial metadata would be "Dunhuang (94.8°E, 40.1°N)." This metadata can be annotated by experts or imported from existing cultural relic databases when the resource is added to the database.
[0082] Temporal and spatial metadata are retrieved from the storage records of each cultural resource. Temporal metadata is stored in a structured format, such as the start year, end year, and dynasty name; spatial metadata is stored in the form of geographic coordinates or standard place names. For temporal metadata, it is uniformly formatted as a continuous time interval, such as mapping "the prosperous Tang Dynasty" to 650-755 AD; for spatial metadata, it is uniformly converted into geographic coordinates or administrative region codes, such as converting "Dunhuang" into coordinates of 94.8 degrees east longitude and 40.1 degrees north latitude and the corresponding administrative region code.
[0083] Historical intent information can originate from query text, query images, or user search settings, and represents search intent information related to cultural history and time. The parsing process is as follows: For the text portion of the query, natural language processing techniques are used to identify temporal and spatial representations. Temporal representation identification employs a dictionary- and rule-based approach. The dictionary contains a list of dynasty names and temporal phrase patterns. Spatial representation identification uses a place name entity recognition model, trained on the BERT architecture. The input query text outputs fragments of text that belong to place names and their standardized place name codes. For the image portion of the query, an image classification model is used to predict the possible era and region to which the image belongs. The image era classification model is trained on the ResNet-50 architecture. The training data includes cultural images labeled with dynasty information, and the output is the probability distribution of an image belonging to each dynasty. The dynasty with the highest probability is taken as the image's temporal intent. The image region classification model is also trained on ResNet-50, with the training data including cultural images labeled with region information. The output is the probability distribution of an image belonging to each cultural region, and the region with the highest probability is taken as the image's spatial intent. When an image contains identifiable text information, such as rubbings of inscriptions or inscriptions on paintings and calligraphy, optical character recognition (OCR) can be used to extract the text and supplement it with temporal and spatial information. When a query contains both text and images, the text recognition results and image recognition results are fused. The fusion rule can be set to prioritize the temporal intent of text recognition and supplement it with the temporal intent of image recognition; when the two conflict, text recognition takes precedence. For spatial intent, text recognition also takes precedence. When a query contains only images, the image recognition results are used as historical intent information; when a query contains only text, the text recognition results are used as historical intent information. In addition, historical intent information can also be obtained from the user's search settings, such as the time range dropdown options or regional filters selected by the user in the search interface. The final extracted historical intent information includes temporal and spatial intent. The temporal intent represents a specific dynasty or time interval; the spatial intent represents a specific region name or geographical coordinate range.
[0084] The historical spatiotemporal knowledge graph is pre-constructed, consisting of nodes and edges. Nodes represent time periods, geographical locations, and cultural resource entities, while edges represent the relationships between resources and time, space, or events. The knowledge graph stores the start and end years of dynasties, the geographical locations and changes of historical place names, and the boundaries of cultural regions. For example, the knowledge graph stores the start year of "Tang Dynasty" as 618 and the end year as 907; the geographical coordinates of "Dunhuang" as 94.8 degrees east longitude and 40.1 degrees north latitude, as well as the historical boundary polygon of Dunhuang Prefecture. The historical spatiotemporal knowledge graph provides a standardized framework for spatiotemporal knowledge representation and reasoning computation, mapping spatiotemporal information from diverse sources into a unified knowledge structure, facilitating the quantification of temporal overlap and spatial coincidence. During the mapping process, the temporal metadata of resources is mapped to time nodes, the spatial metadata of resources is mapped to geographical nodes, and the temporal and spatial intent of queries are also mapped to nodes in the graph.
[0085] The time-dimensional overlap between resources and queries can be calculated using the graph structure. This overlap is defined as the ratio of the intersection length to the union length of the resource's time interval and the query intent's time interval. For example, if the user's query intent's time interval is the Tang Dynasty (618-907 AD), and the cultural resource's time metadata is from the High Tang period (650-755 AD), then the time-dimensional overlap is 105 / 289 = 0.363. Simultaneously, the spatial overlap can be calculated as the ratio of the intersection area to the union area of the resource's coverage area and the query intent's area. For example, the user's query intent's area is 7854 km². 2 The area of cultural resource spatial metadata is 31,200 km². 2 Assuming the area of their intersection is 5000 km² 2 Therefore, the spatial overlap is 5000 / 34054 = 0.147. Preferably, for point resources, the spatial overlap is defined as follows: if the point is within the region, the overlap is 1; if the point is not within the region, the overlap is 0. Both the overlap degree and the coincidence degree take values between 0 and 1.
[0086] After calculating the temporal overlap and spatial overlap, each resource is assessed to determine if it simultaneously meets a preset consistency threshold. This threshold can be set according to search requirements, such as a temporal overlap greater than 0.7 and a spatial overlap greater than 0.6. Resources meeting these criteria are retained, while those not are removed. This filtering ensures that the retained resources match the query in both cultural semantics and historical spatiotemporal attributes, thus forming a refined result set. Each resource in the refined result set retains its resource identifier, text description, image file path, and cross-modal spatiotemporal metadata.
[0087] For example, the initial result set output by S3 contains 42 cultural resources, including "Flying Apsaras Image from Cave 45 of the Dunhuang Mural in the Tang Dynasty," "Flying Apsaras Painting on Silk from the Dunhuang Mural in the Tang Dynasty," "Catalogue of Flying Apsaras Costumes from the Tang Dynasty Mural," "Research Literature on Flying Apsaras Images in Dunhuang," and "Song Dynasty Celadon Bowl," etc. S4 extracts historical intent information in the time dimension of "Tang Dynasty" and the spatial dimension of "Dunhuang." After obtaining the cross-modal spatiotemporal metadata for each resource, the temporal overlap and spatial coincidence are calculated. For the "Song Dynasty Celadon Bowl," its temporal metadata is "Song Dynasty," which has no overlap with "Tang Dynasty," resulting in a temporal overlap of 0, and it fails verification. For the "Research Literature on Flying Apsaras Images in Dunhuang," its spatial metadata is "Beijing," which has a spatial coincidence of 0 with "Dunhuang," and it also fails verification. Finally, 38 resources are retained that pass verification; all of these are cultural resources that belong to the Tang Dynasty in time and are spatially related to Dunhuang. These 38 resources constitute the refined result set.
[0088] S5. Sort the refined result set based on the comprehensive similarity to determine the initial output sequence.
[0089] In one specific embodiment, the process of performing step S5 may specifically include the following steps:
[0090] Extract each cultural resource from the refined result set and obtain the comprehensive similarity corresponding to the candidate vector mapped by each cultural resource;
[0091] Using a pre-defined sorting algorithm, the cultural resources in the refined result set are sorted in descending order of priority according to their corresponding comprehensive similarity scores from largest to smallest.
[0092] The initial output sequence is generated based on the results sorted in descending order of priority.
[0093] Specifically, each candidate vector undergoes reweighting and cosine similarity calculation to obtain a comprehensive similarity value. This value corresponds one-to-one with the candidate vector, and the candidate vector has a mapping relationship with the cultural resource, which is established through the resource identifier. Therefore, the comprehensive similarity can be obtained from the calculation results of S3 based on the identifier of the cultural resource.
[0094] The sorting algorithm can be quicksort or mergesort. In its implementation, cultural resources and their overall similarity scores are grouped into key-value pairs, and then sorted in descending order using the overall similarity score as the key. Since the number of resources in the refined result set is usually no more than a few hundred, the sorting process can be completed quickly in memory. After sorting, a list of cultural resources is obtained, ordered in descending order of similarity. The first resource in the list has the highest overall similarity, and the last resource has the lowest overall similarity.
[0095] The initial output sequence is a sorted list of cultural resources. Each element in the list contains information such as the resource's identifier, text description, image file path, metadata, and corresponding comprehensive similarity. This sequence is maintained in descending order, reflecting the semantic relevance between the resources and the query.
[0096] S6. Based on the semantic features of each resource in the initial output sequence, cluster and group them to construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, local deduplication is performed.
[0097] In one specific embodiment, the process of performing step S6 may specifically include the following steps:
[0098] Extract the semantic feature vectors corresponding to each cultural resource in the initial output sequence, use a clustering algorithm in the cultural semantic alignment space to calculate the spatial distance between each semantic feature vector, and divide the initial output sequence into multiple topic clusters based on the proximity of the spatial distance to construct a candidate resource pool;
[0099] Calculate the average clustering distance of each semantic feature vector within each topic cluster in the cultural semantic alignment space, or the number of vectors within a preset neighborhood radius, and quantify the intra-cluster distribution density of each topic cluster.
[0100] Determine whether the intra-cluster distribution density of any topic cluster exceeds a preset redundancy threshold. If so, within any topic cluster, based on the priority in the initial output sequence, retain a preset number of target cultural resources that rank at the top and remove the remaining cultural resources to complete the local deduplication process for any topic cluster.
[0101] Specifically, the initial output sequence contains a list of cultural resources sorted in descending order of comprehensive similarity, with each resource corresponding to a semantic feature vector in the cultural semantic alignment space. This sequence may contain a large number of semantically similar or repetitive cultural resources, such as different image copies of the same topic, multiple photos of the same artifact, or multiple documents describing the same cultural concept. Directly presenting these to the user would cause information redundancy and reduce browsing efficiency. Therefore, it is necessary to group and aggregate semantically similar resources, remove redundancy within each group, and retain the most representative resources.
[0102] Upon input of each cultural resource, semantic mapping is first performed using a pre-trained contrastive learning model. The projection head maps the single-modal vector to a pre-constructed cultural semantic alignment space through a fully connected layer. For cultural resources containing both text and images, the text and image vectors are weighted and fused according to preset weights to obtain a fused vector. For cultural resources containing only text or only images, the vector of the corresponding modality is directly taken. The fused vector or single-modal vector is then mapped by the projection head to finally generate a 512-dimensional semantic feature vector, which is then subjected to L2 normalization to make the vector magnitude 1. The generated semantic feature vector, along with the corresponding cultural resource identifier, is persistently stored in the database for subsequent cross-modal retrieval, clustering, and ranking operations.
[0103] The clustering algorithm can employ density-based noise-based spatial clustering, which does not require pre-specifying the number of clusters, can automatically identify clusters of arbitrary shapes, and can identify noise points. For example, during clustering, the neighborhood radius parameter is set to 0.3, and the minimum number of samples within the neighborhood of a core point is set to 3. The algorithm execution process is as follows: For each semantic feature vector corresponding to a resource in the initial output sequence, its Euclidean distance to other vectors is calculated. If a vector contains at least 3 other vectors within a neighborhood radius of 0.3, then that vector is marked as a core point, and all vectors within its neighborhood are grouped into the same cluster. The neighborhood of the core point is iteratively expanded, grouping all density-connected vectors into the same cluster. Vectors not grouped into any cluster are marked as noise points. Noise points do not participate in subsequent topic cluster construction but remain in the candidate resource pool as independent resources. After clustering, multiple topic clusters are obtained, each containing a set of cultural resources with similar semantic feature vectors. All topic clusters and un-clustered noise points are merged to form the candidate resource pool.
[0104] The average cluster distance is calculated as follows. For a topic cluster containing m vectors, calculate the pairwise Euclidean distances between all vectors within the cluster, sum them, and divide by the number of vector pairs to obtain the average cluster distance. A smaller average cluster distance indicates a denser cluster of vectors, meaning the resources are semantically closer; a larger average cluster distance indicates a more dispersed cluster of vectors, meaning the resources have greater semantic differences. For example, topic cluster A contains ten vectors with an average pairwise distance of 0.15; topic cluster B contains eight vectors with an average pairwise distance of 0.42. The intra-cluster distribution density of topic cluster A is higher than that of topic cluster B. Following the inverse relationship between distance and density, a smaller average distance between data points indicates a denser data distribution; conversely, a larger average distance indicates a sparser distribution. For example, the mapping formula between the average cluster distance and the intra-cluster distribution density is: Intra-cluster distribution density = 1 / (1 + average cluster distance), using (1 + average cluster distance) to prevent division by zero. As another quantification method, density can also be measured by the number of vectors within a preset neighborhood radius. That is, for each topic cluster, with the cluster center as the center of a sphere, the number of vectors within the preset radius is counted; a higher number indicates a higher density. For example, the mapping formula between the number of vectors within the preset neighborhood radius and the cluster distribution density is: Cluster distribution density = Number of vectors. Either method can be used, or a combination thereof.
[0105] The preset redundancy threshold is a quantification threshold for the distribution density within a cluster, which can be set according to actual needs. For example, when using the average cluster distance as the density quantification index, an upper limit threshold for the average cluster distance can be set (e.g., 0.25). This upper limit threshold corresponds inversely to the preset redundancy threshold: the smaller the average cluster distance, the higher the distribution density within the cluster. Therefore, when the average cluster distance within a cluster is less than 0.25, it indicates that the distribution density within the cluster exceeds the preset redundancy threshold, and the vector distribution within the cluster is considered too dense, indicating redundancy. For topic clusters where the distribution density within the cluster exceeds the threshold, local deduplication is performed. Within the topic cluster, a preset number of target cultural resources are retained based on the priority in the initial output sequence, and the remaining cultural resources are removed. The preset number can be set to 1 or 2, meaning that each dense cluster retains only one or two resources with the highest similarity as representatives. After deduplication, only representative resources are retained for the topic cluster, and the remaining resources are removed from the candidate resource pool. For topic clusters where the distribution density within the cluster does not exceed the threshold, all resources are retained. Noise points are directly retained.
[0106] After the above processing, the candidate resource pool contains multiple topic clusters. The resources in each cluster are subjected to density judgment and deduplication, which maintains semantic relevance and structural clarity, while eliminating redundancy.
[0107] For example, the initial output sequence contains thirty-eight cultural resources, all related to "Tang Dynasty Dunhuang murals," including twenty images of "flying apsaras" from different angles, ten images of "Jataka tales," and eight images of "patrons." In the cultural semantic alignment space, a density-based clustering algorithm is used to cluster into three theme clusters: the flying apsaras cluster contains eighteen resources, the Jataka tales cluster contains nine resources, and the patrons cluster contains seven resources, with four additional resources designated as noise points. The average cluster distance is calculated: 0.12 for the flying apsaras cluster, 0.28 for the Jataka tales cluster, and 0.31 for the patrons cluster. A redundancy threshold of 0.25 is set. If the density of the flying apsaras cluster exceeds the threshold, deduplication is triggered, retaining the two flying apsaras resources with the highest similarity and discarding the remaining sixteen. If the densities of the Jataka tales cluster and the patrons cluster do not exceed the threshold, all resources are retained. The final candidate resource pool contains: 2 resources from the flying apsaras cluster, 9 resources from the Jataka tales cluster, 7 resources from the patrons cluster, and 4 noise point resources, for a total of 22 resources. This candidate resource pool maintains the diversity of topics while eliminating redundancy within dense topics, providing a well-structured and non-redundant input for subsequent steps.
[0108] S7. Based on the preset integrity indicators, perform logical verification on the resource entities and their associated cultural knowledge chains in each theme cluster, and output the verified cultural resource retrieval results.
[0109] In one specific embodiment, the process of performing step S7 may specifically include the following steps:
[0110] For each theme cluster after partial deduplication, a preset cultural domain knowledge graph is queried to extract the cultural knowledge chain containing multi-dimensional cultural attribute node association information corresponding to each resource entity in the cluster. The multi-dimensional cultural attribute nodes include at least one selected from dynasty, creator, physical material and geographical space.
[0111] Based on the preset integrity indicators, check whether there are missing attribute nodes or spatiotemporal logical conflicts in the cultural knowledge chain of each resource entity.
[0112] Remove resource entities that fail the logical validation, combine the validated resource entities, and output the final cultural resource retrieval results.
[0113] Specifically, a knowledge graph in the cultural domain is a pre-constructed knowledge base that stores cultural resources and their relationships in a graph structure. Nodes in the knowledge graph include cultural resource entities, dynasty nodes, creator nodes, geographic nodes, and physical material nodes, etc. Edges represent relationships between entities, such as "created at," "unearthed at," "collected at," "created by," "material is," etc. The knowledge graph is constructed as follows: Basic information about the resources is extracted from the cultural resource base, including fields such as dynasty, creator, excavation location, collection location, and material. These field values are mapped to nodes in the knowledge graph, and association edges are established between resource entities and each attribute node. For example, for the resource "Tang Dynasty Dunhuang Mural Cave 45," the knowledge graph stores the following information: the resource node "Tang Dynasty Dunhuang Mural Cave 45" is connected to the dynasty node "Tang Dynasty" via the edge "belonging to the dynasty," to the geographic node "Dunhuang" via the edge "excavation location," and to the material node "mural" via the edge "material." If the creator information is known, an edge "creator" is also established with the creator node.
[0114] When extracting cultural knowledge chains, each resource entity within a cluster is used as a starting point, and a breadth-first traversal is performed in the knowledge graph to obtain multi-dimensional cultural attribute nodes directly or indirectly related to that entity and their relationships. The traversal depth can be preset to two or three layers. For example, information such as the dynasty to which the resource belongs, its place of origin, material, the start and end years of the dynasty, and the historical evolution of the region can be obtained. The extraction result forms a chain-like or network-like structure containing multi-dimensional cultural attribute node relationship information, which is the cultural knowledge chain of the resource entity. For example, a resource entity in the Flying Apsaras theme cluster is "Tang Dynasty Dunhuang Mural Cave 45". Its cultural knowledge chain can include: the resource itself, the dynasty to which it belongs, "Tang Dynasty" (with start and end years 618-907), the place of origin, "Dunhuang" (with geographical coordinates 94.8 degrees east longitude and 40.1 degrees north latitude), the material, "mural", the related "Tang Dynasty Mural Art Style" node, and the related "Dunhuang Grottoes" node, etc.
[0115] Completeness indicators can be pre-defined as a set of rules, including mandatory attribute rules and logical consistency rules. Mandatory attribute rules specify which core attribute nodes each resource entity should be associated with, such as at least two of dynasty, geographic space, and material. If these core nodes are missing from a resource entity's cultural knowledge chain, it is considered an attribute node missing. For example, if a resource only has a resource node without any associated attribute nodes, or only associates with a material node without associating with a dynasty or geographic space, it is considered an attribute node missing and fails the validation. Logical consistency rules are used to detect contradictory information in the cultural knowledge chain. Detection of hierarchical spatiotemporal logical conflicts includes: whether the dynasty and time interval match (e.g., the resource is associated with the dynasty node "Tang Dynasty," but the associated time node points to the start and end years of the "Song Dynasty"); whether the region and spatial coordinates match (e.g., the resource is associated with the region node "Dunhuang," but the associated geographic coordinates point to "Beijing"); and whether the dynasty and region match (e.g., a "Tang Dynasty" resource is associated with a place name that only appears in the "Song Dynasty"). The rule engine traverses and reasons about the relationships between nodes in the knowledge chain. If the above contradictions are detected, it is determined that there is a conflict in the spatiotemporal logic between the upper and lower levels, and the verification fails.
[0116] For resource entities within each thematic cluster that pass verification, all verified resource entities are sorted in descending order of comprehensive similarity to form the final result list. Noise point resources, if they themselves pass verification, are also included in the final result list. The final output includes the cultural resource's identifier, text description, image file path, spatiotemporal metadata, and associated cultural knowledge chain information, for users to browse and conduct further research.
[0117] For example, the candidate resource pool output by S6 contains three topic clusters (2 resources from the Apsaras cluster, 9 resources from the Jataka tale cluster, and 7 resources from the Donor cluster) and four noise point resources. After querying and verifying the cultural domain knowledge graph, the two resource knowledge chains of the Apsaras cluster both contain the dynasty "Tang Dynasty", the location "Dunhuang", and the material "mural", and all pass the verification. Among the nine resources in the Jataka tale cluster, eight resource knowledge chains contain the dynasty "Tang Dynasty", the location "Dunhuang", and the material "mural", and there are no logical conflicts, so they pass the verification. One resource is removed due to a missing geospatial node. Among the seven resources in the Donor cluster, six resources pass the verification, and one resource is removed because the dynasty node does not match the time interval. Among the four noise point resources, three pass the verification, and one is removed because of a missing attribute node. The final output includes: 2 resources from the Flying Cluster, 8 resources from the Jataka Tales Cluster, 6 resources from the Supporter Cluster, and 3 noise point resources, for a total of 19 resources, which are then output in descending order of comprehensive similarity.
[0118] The above describes the unified multimodal data retrieval method for the cultural field in the embodiments of this application. The following describes the unified multimodal data retrieval system for the cultural field in the embodiments of this application. Please refer to [link / reference]. Figure 4 The schematic diagram of the structure of the unified multimodal data retrieval system for the cultural field provided in this application includes:
[0119] The semantic encoding module 10 is used to acquire multimodal query data input by the user, extract semantics from it using a preset contrastive learning model, and obtain the query semantic vector.
[0120] The vector retrieval module 20 is used to perform retrieval and matching in the pre-set multimodal vector index of the cultural resource database based on the query semantic vector to obtain a candidate vector set.
[0121] The cultural alignment module 30 is used to project the query semantic vector and each candidate vector to a preset cultural semantic alignment space, reweight the dimensional components of different cultural attributes in the space using a preset weight matrix, calculate the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and select a preliminary result set.
[0122] The intent refinement module 40 is used to perform semantic consistency comparison between the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set and the historical intent information extracted from the multimodal query data to obtain a refined result set.
[0123] The similarity sorting module 50 is used to sort the refined result set according to the comprehensive similarity and determine the initial output sequence.
[0124] The clustering and deduplication module 60 is used to cluster and group resources according to the semantic features of each resource in the initial output sequence, and construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, local deduplication processing is performed.
[0125] The verification output module 70 is used to perform logical verification on the resource entities and their associated cultural knowledge chains in each theme cluster by combining preset integrity indicators, and output the cultural resource retrieval results that have passed the verification.
[0126] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A method for unified search of multi-modal data in the field of culture, characterized in that, The method includes: S1. Obtain multimodal query data input by the user, and use a preset contrastive learning model to extract semantics from it to obtain the query semantic vector; S2. Based on the query semantic vector, perform retrieval and matching in the preset multimodal vector index of the cultural resource database to obtain a candidate vector set; S3. Project the query semantic vector and each candidate vector to a preset cultural semantic alignment space, reweight the dimensional components of different cultural attributes in the space using a preset weight matrix, calculate the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and select a preliminary result set. S4. Perform a semantic consistency comparison between the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set and the historical intent information extracted from the multimodal query data to obtain a refined result set; S5. Sort the refined result set based on the comprehensive similarity to determine the initial output sequence; S6. Based on the semantic features of each resource in the initial output sequence, cluster and group them to construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, perform local deduplication. S7. Based on the preset integrity indicators, perform logical verification on the resource entities and their associated cultural knowledge chains in each theme cluster, and output the verified cultural resource retrieval results. S4 includes: Obtain the cross-modal spatiotemporal metadata of each cultural resource pre-bound in the preliminary result set; The multimodal query data is parsed to extract the historical intent information that represents the user's search request. The historical intent information includes time-dimensional intent and spatial-dimensional intent. The cross-modal spatiotemporal metadata and the historical intent information are mapped to a preset historical spatiotemporal knowledge graph. The temporal period overlap degree of the time dimension intent and the cross-modal spatiotemporal metadata, as well as the spatial region overlap degree of the spatial dimension intent and the cross-modal spatiotemporal metadata, are calculated respectively. Determine whether the temporal overlap and spatial overlap of each cultural resource meet the preset consistency verification threshold. If so, retain the corresponding cultural resource and construct the refined result set.
2. The method according to claim 1, characterized in that, S1 includes: The multimodal query data is parsed to obtain text description features and image visual features, which are then input into the text encoder and image encoder in the contrastive learning model, respectively, to extract unimodal feature vectors. The single-modal feature vector is projected onto the unified joint semantic space constructed by the contrastive learning model to obtain a cross-modal embedding representation. The unified joint semantic space is generated by pre-training and aligning culturally specific concept samples through a contrastive learning strategy that brings positive sample pairs of different modalities under the same cultural concept closer and pushes away negative sample pairs of different modalities under different cultural concepts. The cross-modal embedding representation is subjected to multimodal feature fusion processing based on preset modal fusion weights, and the query semantic vector is output.
3. The method of claim 1, wherein, The multimodal vector index is a high-dimensional spatial index with a unified dimension, constructed by pre-mapping the text and image data in the cultural resource database using an embedding extraction model, adjusting the vector dimension through a preset dimension transformation strategy to reduce semantic bias between modalities, and S2 includes: The dimension transformation strategy is used to perform dimension alignment processing on the query semantic vector to obtain the mapped query vector; Based on the mapping query vector, an approximate nearest neighbor search is performed in the multimodal vector index. The spatial metric distance between the mapping query vector and each data vector in the high-dimensional spatial index is calculated, and a subset of data whose spatial metric distance meets the preset conditions is selected as the candidate vector set.
4. The method of claim 1, wherein, S3 includes: Project the query semantic vector and each candidate vector in the candidate vector set onto the cultural semantic alignment space to obtain the corresponding spatial projection vector; Using the preset weight matrix that incorporates culturally unique concepts, the dimensional components representing different cultural attributes in each spatial projection vector are weighted and adjusted, and the reweighted query semantic vector and the reweighted candidate vector are output. The cosine similarity between the reweighted query semantic vector and each reweighted candidate vector in the cultural semantic alignment space is defined as the comprehensive similarity. Determine whether the comprehensive similarity corresponding to each candidate vector meets the preset screening conditions, and construct the preliminary result set based on the cultural resources mapped by the candidate vectors that meet the preset screening conditions.
5. The method according to claim 1, characterized by S5 include: Extract each cultural resource from the refined result set, and obtain the comprehensive similarity corresponding to the candidate vector mapped by each cultural resource; Using a preset sorting algorithm, the cultural resources in the refined result set are sorted in descending order of priority according to their corresponding comprehensive similarity values from largest to smallest. The initial output sequence is generated based on the descending priority order.
6. The method of claim 1, wherein S6 include: Extract the semantic feature vectors corresponding to each cultural resource in the initial output sequence, use a clustering algorithm to calculate the spatial distance between each semantic feature vector in the cultural semantic alignment space, divide the initial output sequence into multiple topic clusters based on the proximity of the spatial distance, and construct the candidate resource pool; The average clustering distance of each semantic feature vector within each topic cluster in the cultural semantic alignment space, or the number of vectors within a preset neighborhood radius, is calculated to quantify the intra-cluster distribution density of each topic cluster. If the distribution density within any topic cluster exceeds the preset redundancy threshold, then within any topic cluster, based on the priority in the initial output sequence, a preset number of target cultural resources that rank at the top are retained, and the remaining cultural resources are removed, thus completing the local deduplication process for any topic cluster.
7. The method of claim 1, wherein, S7 includes: For each theme cluster after local deduplication, a preset cultural domain knowledge graph is queried to extract the cultural knowledge chain containing multi-dimensional cultural attribute node association information corresponding to each resource entity in the cluster. The multi-dimensional cultural attribute nodes include at least one selected from dynasty, creator, physical material and geographic space. Based on the preset integrity index, check whether there are missing attribute nodes or spatiotemporal logic conflicts between different levels in the cultural knowledge chain of each resource entity. Remove resource entities that fail the logical validation, combine the validated resource entities, and output the final cultural resource retrieval results.
8. A multi-modal data unified search system for cultural domain, for implementing the method according to any one of claims 1 to 7, characterized in that, The system includes: The semantic encoding module is used to acquire multimodal query data input by the user, extract semantics from it using a pre-set contrastive learning model, and obtain the query semantic vector. The vector retrieval module is used to perform retrieval and matching in the preset multimodal vector index of the cultural resource database based on the query semantic vector to obtain a candidate vector set; The cultural alignment module is used to project the query semantic vector and each candidate vector to a preset cultural semantic alignment space, reweight the dimensional components of different cultural attributes in the space using a preset weight matrix, calculate the comprehensive similarity between the reweighted query semantic vector and each candidate vector, and select a preliminary result set. The intent refinement module is used to perform semantic consistency comparison between the cross-modal spatiotemporal metadata associated with each resource in the preliminary result set and the historical intent information extracted from the multimodal query data to obtain a refined result set. The similarity sorting module is used to sort the refined result set according to the comprehensive similarity to determine the initial output sequence; The clustering and deduplication module is used to cluster and group resources according to the semantic features of each resource in the initial output sequence, and construct a candidate resource pool containing multiple topic clusters. For any topic cluster, if the intra-cluster distribution density of resources within the cluster exceeds a preset redundancy threshold, local deduplication processing is performed. The verification output module is used to perform logical verification on resource entities and their associated cultural knowledge chains in each theme cluster by combining preset integrity indicators, and output the cultural resource retrieval results that have passed the verification.