Cross-modal retrieval method, device and computer equipment for cultural heritage

By combining a multimodal large language model and a hypergraph convolutional neural network, the image, text and cultural symbol features of cultural heritage are explicitly extracted, solving the problems of complex many-to-many interactions and semantic alignment in cross-modal retrieval and improving retrieval accuracy.

CN122196221APending Publication Date: 2026-06-12CHANGSHU INSTITUTE OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHANGSHU INSTITUTE OF TECHNOLOGY
Filing Date
2026-05-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing cross-modal retrieval technologies struggle to effectively capture complex many-to-many interactions and lack explicit cultural semantic extraction and alignment in cultural heritage resources, resulting in low retrieval accuracy.

Method used

A pre-trained multimodal large language model is used to extract features from images, texts and cultural symbols. A cross-modal retrieval model is constructed through multi-head cross-attention and hypergraph convolutional neural networks to explicitly extract cultural symbols and perform high-order relationship modeling and global information aggregation.

🎯Benefits of technology

It improves the accuracy of cross-modal retrieval of cultural heritage, and realizes explicit extraction and precise matching of the core value of cultural heritage.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196221A_ABST
    Figure CN122196221A_ABST
Patent Text Reader

Abstract

The application discloses a cross-modal retrieval method, device and computer equipment for cultural heritage, which comprises the following steps: obtaining image information, text information and cultural symbol information from cultural asset resources; using a pre-trained multi-modal large model to extract initial image features, initial text features and initial cultural symbol features from the obtained image information, text information and cultural symbol information; inputting the three initial features into a multi-head cross attention mechanism respectively to obtain target image features, target text features, image interaction features and text interaction features; inputting the four features into a hypergraph convolutional neural network respectively to obtain first features, second features, third features and fourth features; combining all the extracted features to determine a target function to construct a cross-modal retrieval model; and inputting to-be-retrieved information into the trained cross-modal retrieval model to output a retrieval result, so as to solve the problem of low cross-modal retrieval precision for cultural heritage.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of information retrieval technology, and in particular to a cross-modal retrieval method, apparatus and computer equipment for cultural heritage. Background Technology

[0002] With the rapid development of digital technology, the digital protection and display of cultural heritage resources has become increasingly widespread, generating massive amounts of image and text data. To achieve efficient management and utilization of this multimodal data, cross-modal retrieval technology has emerged. Cross-modal retrieval aims to bridge the "semantic gap" between images and text, enabling image-to-text or text-to-image searches. However, because cultural heritage resources often contain profound historical connotations, their data not only includes superficial visual elements and literal descriptions but also harbors deep cultural symbols and complex semantic relationships. This places extremely high demands on the feature representation capabilities of retrieval models.

[0003] Most existing cross-modal retrieval techniques face significant technical bottlenecks in practical applications. On the one hand, existing methods typically rely on simple pairwise similarity (i.e., only considering the direct connection between two samples) when constructing relationships between samples, making it difficult to effectively capture the complex "many-to-many" interactions prevalent in real-world datasets. On the other hand, general retrieval models lack explicit extraction and cross-modal alignment mechanisms for cultural semantics, resulting in difficulties in achieving accurate feature matching when faced with semantically complex cultural resources, leading to the technical problem of low accuracy in cross-modal retrieval of cultural heritage. Summary of the Invention

[0004] In view of this, embodiments of this application provide a method, apparatus, and computer device for cross-modal retrieval of cultural heritage, so as to at least solve the technical problem of low accuracy in cross-modal retrieval of cultural heritage.

[0005] According to one aspect of this application, a cross-modal retrieval method for cultural heritage is provided. The method includes: acquiring image information, text information, and cultural symbol information from cultural asset resources; using a pre-trained multimodal large language model, acquiring initial image features, initial text features, and initial cultural symbol features from the image information, text information, and cultural symbol information, respectively; inputting the initial image features, initial text features, and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model, respectively, to obtain target image features, target text features, image interaction features, and text interaction features; inputting the target image features, target text features, image interaction features, and text interaction features into the hypergraph convolutional neural network of the cross-modal retrieval model, respectively, to obtain first features, second features, third features, and fourth features; determining the objective function of the cross-modal retrieval model based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, first features, second features, third features, and fourth features, to construct the cross-modal retrieval model; acquiring information to be retrieved, and inputting the information to be retrieved into the pre-trained cross-modal retrieval model for analysis to obtain retrieval results.

[0006] Optionally, the target image features, target text features, image interaction features, and text interaction features are respectively input into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature, including: constructing the first association matrix, the second association matrix, the third association matrix, and the fourth association matrix based on the target image features, target text features, image interaction features, and text interaction features, respectively; and inputting the first association matrix, the second association matrix, the third association matrix, and the fourth association matrix into the hypergraph convolutional neural network to obtain the first feature, the second feature, the third feature, and the fourth feature.

[0007] Optionally, based on initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, a first feature, a second feature, a third feature, and a fourth feature, an objective function is determined to construct the cross-modal retrieval model. This includes: determining image hash coding features and binary image features based on the first and third features, and determining text hash coding features and binary text features based on the second and fourth features; and determining the objective function based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features to construct the cross-modal retrieval model.

[0008] Optionally, based on the first feature and the third feature, determining the image hash coding feature and the binary image feature includes: fusing the first feature and the third feature to obtain a first fused feature; mapping the first fused feature to obtain the image hash coding feature; and converting the image hash coding feature using a symbol function to obtain the binary image feature. Based on the second feature and the fourth feature, determining the text hash coding feature and the binary text feature includes: fusing the second feature and the fourth feature to obtain a second fused feature; mapping the second fused feature to obtain the text hash coding feature; and converting the text hash coding feature using a symbol function to obtain the binary text feature.

[0009] Optionally, based on initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features, an objective function is determined to construct a cross-modal retrieval model. This includes: constructing a triplet loss function based on initial image features, initial text features, and initial cultural symbol features; constructing a first quadruple loss function based on target image features, target text features, image interaction features, and text interaction features; constructing a second quadruple loss function based on image hash coding features, binary image features, text hash coding features, and binary text features; and determining the objective function based on the triplet loss function, the first quadruple loss function, and the second quadruple loss function to construct the cross-modal retrieval model.

[0010] Optionally, a triplet loss function is constructed based on the initial image features, initial text features, and initial cultural symbol features, including: determining a first cross-entropy loss function, a second cross-entropy loss function, and a third cross-entropy loss function based on the initial image features, initial text features, and initial cultural symbol features; and constructing a triplet loss function based on a first preset coefficient, the first cross-entropy loss function, the second cross-entropy loss function, and the third cross-entropy loss function.

[0011] Optionally, a first quadruple loss function is constructed based on target image features, target text features, image interaction features, and text interaction features, including: determining a fourth cross-entropy loss function, a fifth cross-entropy loss function, a sixth cross-entropy loss function, and a seventh cross-entropy loss function based on target image features, target text features, image interaction features, and text interaction features; and constructing the first quadruple loss function based on a second preset coefficient, the fourth cross-entropy loss function, the fifth cross-entropy loss function, the sixth cross-entropy loss function, and the seventh cross-entropy loss function.

[0012] Optionally, a second quaternion loss function is constructed based on image hash coding features, binary image features, text hash coding features, and binary text features, including: determining the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function based on image hash coding features, binary image features, text hash coding features, and binary text features; and constructing the second quaternion loss function based on the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function using the second preset coefficients.

[0013] According to another aspect of this application, a cross-modal retrieval device for cultural heritage is provided. The device includes: a first acquisition unit for acquiring image information, text information, and cultural symbol information from cultural asset resources; a second acquisition unit for acquiring initial image features, initial text features, and initial cultural symbol features from the image information, text information, and cultural symbol information respectively using a pre-trained multimodal large language model; a third acquisition unit for inputting the initial image features, initial text features, and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model respectively to obtain target image features, target text features, image interaction features, and text interaction features; and a fourth acquisition unit. The system is divided into four parts: a first feature unit, a second feature unit, a third feature unit, and a fourth feature unit. The first feature unit is used to input the target image features, target text features, image interaction features, and text interaction features into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature. The second feature unit is used to determine the objective function of the cross-modal retrieval model based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, the first feature, the second feature, the third feature, and the fourth feature, so as to construct the cross-modal retrieval model. The third feature unit is used to obtain the information to be retrieved and input the information to be retrieved into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

[0014] According to another aspect of this application, a storage medium is provided on which a computer program is stored, which, when executed by a processor, implements the aforementioned cross-modal retrieval method for cultural heritage.

[0015] According to another aspect of this application, a computer device is provided, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor executes the program to implement the aforementioned cross-modal retrieval method for cultural heritage.

[0016] Based on the above technical solution, this application provides a cross-modal retrieval method for cultural heritage. The method includes: acquiring image information, text information, and cultural symbol information from cultural asset resources; using a pre-trained multimodal large language model to acquire initial image features, initial text features, and initial cultural symbol features from the image information, text information, and cultural symbol information, respectively; inputting the initial image features, initial text features, and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model, respectively, to obtain target image features, target text features, image interaction features, and text interaction features; and inputting the target image... Image features, target text features, image interaction features, and text interaction features are input into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, second feature, third feature, and fourth feature. Based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, first feature, second feature, third feature, and fourth feature, the objective function of the cross-modal retrieval model is determined to construct the cross-modal retrieval model. The information to be retrieved is obtained and input into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results. In other words, in this embodiment, firstly, image information, text information, and cultural symbol information are obtained from cultural asset resources. Then, using a pre-trained multimodal large model, corresponding initial image features, initial text features, and initial cultural symbol features are extracted from the obtained image information, text information, and cultural symbol information. Next, the three initial features are input into a multi-head cross-attention mechanism to obtain target image features, target text features, image interaction features, and text interaction features. Then, these four features are input into a hypergraph convolutional neural network to obtain a first feature, a second feature, a third feature, and a fourth feature that further contain higher-order relationships. All the features extracted above are combined to determine the objective function to construct a cross-modal retrieval model. Finally, the information to be retrieved is input into the trained model to output the final retrieval result. Considering the use of multi-head cross-attention to acquire target image features, target text features, image interaction features, and text interaction features, and to explicitly extract cultural symbols containing the core value of cultural heritage and participate in cross-modal feature alignment, a hypergraph convolutional neural network is introduced to model high-order complex relationships and aggregate global information for target image features, target text features, and their image interaction features and text interaction features, thereby solving the technical problem of low cross-modal retrieval accuracy of cultural heritage and achieving the technical effect of improving the cross-modal retrieval accuracy of cultural heritage.

[0017] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description

[0018] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 A flowchart illustrating a cross-modal retrieval method for cultural heritage provided in an embodiment of this application is shown. Figure 2 A flowchart illustrating another cross-modal retrieval method for cultural heritage provided in an embodiment of this application is shown; Figure 3 This illustration shows a structural schematic diagram of a cross-modal retrieval device for cultural heritage provided in an embodiment of this application; Figure 4 A schematic diagram of the device structure of a computer device provided in an embodiment of this application is shown. Detailed Implementation

[0019] The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the embodiments of the present application can be combined with each other.

[0020] In this embodiment, Figure 1 This document illustrates a flowchart of a cross-modal retrieval method for cultural heritage provided in an embodiment of this application. Figure 1 As shown, the method includes the following steps: Step S101: Obtain image information, text information, and cultural symbol information from cultural asset resources.

[0021] In the technical solution provided in step S101 of this application, image information, text information, and cultural symbol information can be obtained from cultural asset resources. The image information, text information, and cultural symbol information are all sample information, which can be used to train a cross-modal retrieval model. Cultural asset resources can be referred to as cultural heritage graphic resources.

[0022] Step S102: Using a pre-trained multimodal large language model, initial image features, initial text features, and initial cultural symbol features are obtained from image information, text information, and cultural symbol information, respectively.

[0023] In the technical solution provided in step S102 of this application, a pre-trained multimodal large language model (CLIP model) is used to extract initial image features, initial text features, and initial cultural symbol features from image information, text information, and cultural symbol information, respectively.

[0024] Optionally, the initial image features can be simply referred to as image features, through... Representation is performed. The initial text features, which can be simply referred to as text features, are represented through... This is represented. Initial cultural symbolic characteristics, which can be simply referred to as cultural symbolic characteristics, are expressed through… To express.

[0025] For example, suppose the training set of cultural heritage resources consists of image-text pairs. ,in, I 1 is used to represent the first image. I 2 is used to represent the second image. I N This is used to represent the Nth image, where N represents a positive integer. T 1 is used to indicate the first text. T 2 is used to represent the second text. T N Used to represent the Nth text. Input the Nth... Images of a sample ( Image encoder using CLIP model ( Extracting image features ; Enter the first The text of each sample ( ), using the CLIP model text encoder ( Extracting text features ; Enter the first Images of a sample ,text and learnable prompt word parameters The cultural symbol encoder using the CLIP model ( Extracting cultural symbol features ,in, Used to indicate the length of the prompt word parameter. The feature dimension used to represent the prompt word parameter.

[0026] Step S103: Input the initial image features, initial text features, and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model to obtain target image features, target text features, image interaction features, and text interaction features.

[0027] In the technical solution provided by step S103 of this application, after obtaining the initial image features, initial text features and initial cultural symbol features, the initial image features, initial text features and initial cultural symbol features can be input into the multi-head cross-attention of the cross-modal retrieval model, so as to obtain the target image features, target text features, image interaction features and text interaction features.

[0028] Optionally, the multi-head cross-attention network includes at least: a first multi-head cross-attention network, a second multi-head cross-attention network, a third multi-head cross-attention network, and a fourth multi-head cross-attention network, and the above four multi-head cross-attention networks are designed in parallel in the cross-modal retrieval model.

[0029] Optionally, the initial cultural symbol features and initial image features are input into the first multi-head cross-attention to obtain the target image features, wherein the target image features are cultural symbol-enhanced image features, through... To represent it. For example, the following formula is used to calculate the first... Image features enhanced with cultural symbols for each sample ,by As a query source Using the target key and value, a multi-head cross-attention mechanism is constructed, and then connected... As a residual term. The first multi-head function can be expressed as: = , No. The size of the attention is = ,in, These are the query source, target key, target value, and weight matrix connecting the multi-head projections in the first multi-head cross-attention process. At this point, the target image features are: .

[0030] Optionally, the initial text features and initial cultural symbol features are input into a second multi-head cross-attention algorithm to obtain target text features, wherein the target text features are text features enhanced with cultural symbols. To represent it. For example, the following formula is used to calculate the first... Cultural symbol enhancement text features of each sample ,by As a query source Using the target key and value, a multi-head cross-attention mechanism is constructed, and then connected... As a residual term. The second multi-head function can be expressed as: = , No. The size of the attention is = ,in,

[0031] These are the query source, target key, target value, and weight matrices connecting the multi-head projections in the second multi-head cross-attention process. At this point, the target text features are... .

[0032] Optionally, the initial image features and target text features are input into a third multi-head cross-attention algorithm to obtain image interaction features, wherein the image interaction features are text content-guided image interaction features, through... To represent it. For example, the following formula is used to calculate the first... The text-content-guided image interaction features of each sample are as follows: .by As a query source Using the keys and values ​​as the query targets, a multi-head cross-attention mechanism is constructed, and then connected... As a residual term. The third multi-head function can be expressed as: = ,calculate The The size of the attention is = ,in, Let the query source, target key, target value, and weight matrix connecting the multi-head projections in the third multi-head cross-attention be represented respectively. Then, the image interaction features can be expressed as: .

[0033] Optionally, the initial text features and target image features are input into a fourth multi-head cross-attention algorithm to obtain text interaction features, where the text interaction features are image content-guided interaction features, through... To represent it. For example, the following formula is used to calculate the first... Image content-guided text interaction features of individual samples .by As a query source Simultaneously serving as multi-head cross-attention for both target keys and values, and then connecting... As a residual term. The fourth multi-head function can be expressed as: = ,calculate The The size of the attention is = ,in, Let the query source, target key, target value, and weight matrix connecting the multi-head projections be the query source, target key, target value, and weight matrix of the fourth multi-head cross-attention mechanism, respectively. Then, the text interaction features guided by the image content are: .

[0034] It should be noted that this is only a preferred embodiment for obtaining target image features, target text features, image interaction features, and text interaction features. The process and method for obtaining target image features, target text features, image interaction features, and text interaction features are not specifically limited. As long as the target image features, target text features, image interaction features, and text interaction features are obtained through multi-head cross-attention and based on the initial image features, initial text features, and initial cultural symbol features, they are all within the protection scope of this application and will not be listed here.

[0035] Step S104: Input the target image features, target text features, image interaction features, and text interaction features into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature.

[0036] In the technical solution provided by step S104 of this application, the obtained target image features, target text features, image interaction features and text interaction features are respectively input into the hypergraph convolutional neural network of the cross-modal retrieval model in order to obtain the first feature, the second feature, the third feature and the fourth feature.

[0037] Optionally, the first feature is obtained through This is represented. The second feature is expressed through... This is represented. The third feature is expressed through... The fourth feature is represented. To express.

[0038] For example, the target image features Target text features Image interaction features Text interaction features The input is fed into a hypergraph convolutional neural network to extract the first feature. Second feature Third feature and the fourth feature .

[0039] It is understood that this is only a preferred embodiment for obtaining the first feature, the second feature, the third feature, and the fourth feature, and the process and method for obtaining the first feature, the second feature, the third feature, and the fourth feature are not specifically limited.

[0040] Step S105: Based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, first feature, second feature, third feature, and fourth feature, determine the objective function of the cross-modal retrieval model to construct the cross-modal retrieval model.

[0041] In the technical solution provided in step S105 of this application, based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, first features, second features, third features, and fourth features obtained in the above steps, the objective function of the cross-modal retrieval model can be determined, and then a cross-modal retrieval model can be constructed based on the objective function. The objective function can be called the objective loss function, which is obtained through... To express.

[0042] For example, based on the 11 features obtained from the above steps, an objective function can be constructed. Then, based on the above objective function, a cross-modal retrieval model can be constructed. That is, the model updates the parameters according to the backpropagation of the loss function to construct an accurate retrieval model (cross-modal retrieval model) that can simultaneously take into account shallow visual, deep cultural semantics and complex high-order relationships.

[0043] It should be noted that this is only a preferred implementation method for constructing a cross-modal retrieval model, and no specific limitations are made on the process and method of constructing a cross-modal retrieval model.

[0044] Step S106: Obtain the information to be retrieved and input it into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

[0045] In the technical solution provided in step S106 of this application, the gradient descent method is used to train the cross-modal retrieval model to obtain a pre-trained cross-modal retrieval model. Then, the information to be retrieved is obtained and input into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

[0046] For example, after training the cross-modal retrieval model, if the information to be retrieved is "Song Dynasty celadon with lotus petal pattern", inputting the information to be retrieved into the pre-trained cross-modal retrieval model will yield an image of Song Dynasty celadon with lotus petal pattern.

[0047] For another example, after training the cross-modal retrieval model, the Hamming distance between the information to be retrieved and all samples in the retrieval set is calculated. These samples are then sorted in ascending order of distance (i.e., from high to low similarity), and a preset number of samples are selected as the final retrieval results output to the user. The preset number can be determined by... To express.

[0048] In the technical solution provided by steps S101 to S106 of this application, firstly, image information, text information, and cultural symbol information are obtained from cultural asset resources. Then, using a pre-trained multimodal large model, corresponding initial image features, initial text features, and initial cultural symbol features are extracted from the obtained image information, text information, and cultural symbol information. Then, the three initial features are respectively input into a multi-head cross-attention mechanism to obtain target image features, target text features, image interaction features, and text interaction features. Then, these four features are respectively input into a hypergraph convolutional neural network to obtain a first feature, a second feature, a third feature, and a fourth feature that further contain higher-order relationships. All the features extracted above are combined to determine the objective function to construct a cross-modal retrieval model. Finally, the information to be retrieved is input into the trained model to output the final retrieval result. Considering the use of multi-head cross-attention to acquire target image features, target text features, image interaction features, and text interaction features, and to explicitly extract cultural symbols containing the core value of cultural heritage and participate in cross-modal feature alignment, a hypergraph convolutional neural network is introduced to model high-order complex relationships and aggregate global information for target image features, target text features, and their image interaction features and text interaction features, thereby solving the technical problem of low cross-modal retrieval accuracy of cultural heritage and achieving the technical effect of improving the cross-modal retrieval accuracy of cultural heritage.

[0049] The method described in this embodiment will be further described below.

[0050] As an optional embodiment, the target image features, target text features, image interaction features, and text interaction features are respectively input into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature, including: constructing a first association matrix, a second association matrix, a third association matrix, and a fourth association matrix based on the target image features, target text features, image interaction features, and text interaction features, respectively; and inputting the first association matrix, the second association matrix, the third association matrix, and the fourth association matrix into the hypergraph convolutional neural network to obtain the first feature, the second feature, the third feature, and the fourth feature.

[0051] In this embodiment, a first correlation matrix is ​​constructed based on target image features; a second correlation matrix is ​​constructed based on target text features; a third correlation matrix is ​​constructed based on image interaction features; and a fourth correlation matrix is ​​constructed based on text interaction features. The first correlation matrix is ​​constructed through... The second correlation matrix is ​​represented by... The third correlation matrix is ​​represented by... The fourth correlation matrix is ​​represented. To express.

[0052] For example, the features obtained based on the above steps (r=1,2,3,4), construct vertices ( ) and hyperedge ( The correlation matrix of ) ,in, Used to represent the number of edges of a hyperedge. Feature samples. The vertex set is The initial hyperedge set Empty. For the first... Consider each sample as a central node and find its relationship with other samples in terms of features. The most similar These samples, along with the adjacent feature samples and the center node, form a hyperedge ( ).if Not in the superedge set, i.e. The set of inserted hyperedges can be represented by the following formula. Based on whether N samples belong to the hyperedge set. Construct an association matrix from the edges in the matrix. That is, if the first The sample belongs to the first Strip edge ,but The Middle Line number The elements of the column are =1, otherwise =0.

[0053] In this embodiment of the application, the first correlation matrix, the second correlation matrix, the third correlation matrix and the fourth correlation matrix obtained above are respectively input into the hypergraph convolutional neural network to obtain the first feature, the second feature, the third feature and the fourth feature.

[0054] Optionally, the hypergraph convolutional neural network includes at least: a first hypergraph convolutional neural network, a second hypergraph convolutional neural network, a third hypergraph convolutional neural network, and a fourth hypergraph convolutional neural network, and the first, second, third, and fourth hypergraph convolutional neural networks are arranged in parallel in the cross-modal retrieval model. The hypergraph convolutional neural network utilizes... To express.

[0055] For example, after inputting the target image features, target text features, image interaction features, and text interaction features into the corresponding first hypergraph convolutional neural network, second hypergraph convolutional neural network, third hypergraph convolutional neural network, and fourth hypergraph convolutional neural network, respectively, the hypergraph convolutional neural networks learn information containing higher-order relationships. 3D features ,in, It is a supermap The parameters of the convolutional neural network, and .

[0056] As an optional implementation, the objective function of the cross-modal retrieval model is determined based on initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, a first feature, a second feature, a third feature, and a fourth feature to construct the cross-modal retrieval model. This includes: determining image hash coding features and binary image features based on the first and third features, and determining text hash coding features and binary text features based on the second and fourth features; and determining the objective function based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features to construct the cross-modal retrieval model.

[0057] In this embodiment, image hash coding features and binary image features are determined based on the first and third features, and text hash coding features and binary text features are determined based on the second and fourth features. Then, based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features, an objective function is determined to achieve the purpose of constructing a cross-modal retrieval model. The image hash coding feature can be called image continuous hash coding, which is obtained through... This is represented. Binary image features can be called image binary hash encoding, which can be achieved through... This is represented. Text hash encoding features can be called text hash encoding, or text sequential hash encoding, and are expressed through... This is represented. Binary text features can be called text binary hash encoding, and are represented through... To express.

[0058] For example, based on the first feature and third feature Determine image hash coding features and binary image features Then, based on the second feature and the fourth feature Determine the text hash encoding features and binary text features In order to determine the objective function based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features, so as to construct a cross-modal retrieval model.

[0059] As an optional embodiment, determining image hash coding features and binary image features based on a first feature and a third feature includes: fusing the first feature and the third feature to obtain a first fused feature; mapping the first fused feature to obtain an image hash coding feature; and converting the image hash coding feature using a symbolic function to obtain a binary image feature. Determining text hash coding features and binary text features based on a second feature and a fourth feature includes: fusing the second feature and the fourth feature to obtain a second fused feature; mapping the second fused feature to obtain a text hash coding feature; and converting the text hash coding feature using a symbolic function to obtain a binary text feature.

[0060] In this embodiment, determining image hash coding features and binary image features based on a first feature and a third feature includes: fusing the obtained first feature and third feature using an activation function to obtain a first fused feature; then mapping the first fused feature using an activation function to obtain image hash coding features; and finally converting the image hash coding features using a sign function to achieve the purpose of obtaining binary image features. The activation function can be... Functions. Symbolic functions can be... Function. The first fusion feature can be obtained through To express.

[0061] For example, through activation functions Regarding image features and Fusion as the first fusion feature Then project the first fusion feature as 3D image continuous hash encoding Finally, through the symbolic function Image sequential hash encoding conversion 2D binary image features ,in, and The weight matrix, and Let be the deviation vector. Used to represent feature dimensions.

[0062] Optionally, based on the second and fourth features, the text hash encoding features and binary text features are determined, including: fusing the second and fourth features using an activation function to obtain a second fused feature; then mapping the second fused feature using an activation function to obtain the text hash encoding features; and finally transforming the text hash encoding features using a symbol function to obtain the binary text features. The second fused feature is determined by... To express.

[0063] For example, through activation functions Regarding text features and Fusion as the second fusion feature Then project the second fused feature as Dimensional text hash encoding Finally, through the symbolic function Convert text hash encoding to Binary text features of dimensionality ,in, and The weight matrix, and This is the deviation vector.

[0064] As an optional implementation, a target function is determined based on initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, image hash coding features, binary image features, text hash coding features, and binary text features to construct a cross-modal retrieval model. This includes: constructing a triplet loss function based on initial image features, initial text features, and initial cultural symbol features; constructing a first quadruple loss function based on target image features, target text features, image interaction features, and text interaction features; constructing a second quadruple loss function based on image hash coding features, binary image features, text hash coding features, and binary text features; and determining the target function based on the triplet loss function, the first quadruple loss function, and the second quadruple loss function to construct the cross-modal retrieval model.

[0065] In this embodiment, a triplet loss function is constructed based on initial image features, initial text features, and initial cultural symbol features. Then, a first quadruple loss function is constructed based on target image features, target text features, image interaction features, and text interaction features. Subsequently, a second quadruple loss function is constructed based on image hash coding features, binary image features, text hash coding features, and binary text features. Finally, the target function is determined based on the triplet loss function, the first quadruple loss function, and the second quadruple loss function obtained above, in order to construct a cross-modal retrieval model.

[0066] Optionally, the triplet loss function can be called the triplet cross-entropy loss mean function. The first quaternion loss function can be called the first quaternion cross-entropy comparison loss mean function. The second quaternion loss function can be called the second quaternion cross-entropy comparison loss mean function.

[0067] For example, after constructing the mean cross-entropy loss function of triples, the mean cross-entropy loss function of the first quaternion, and the mean cross-entropy loss function of the second quaternion, the mean cross-entropy loss function of triples, the mean cross-entropy loss function of the first quaternion, and the mean cross-entropy loss function of the second quaternion are weighted and summed to determine the objective function, thereby achieving the goal of constructing a cross-modal retrieval model.

[0068] As an optional implementation, a triplet loss function is constructed based on initial image features, initial text features, and initial cultural symbol features, including: determining a first cross-entropy loss function, a second cross-entropy loss function, and a third cross-entropy loss function based on the initial image features, initial text features, and initial cultural symbol features; and constructing a triplet loss function based on a first preset coefficient, the first cross-entropy loss function, the second cross-entropy loss function, and the third cross-entropy loss function.

[0069] In this embodiment, a first cross-entropy loss function, a second cross-entropy loss function, and a third cross-entropy loss function are determined based on initial image features, initial text features, and initial cultural symbol features. Furthermore, a triplet loss function is constructed based on a first preset coefficient, the first cross-entropy loss function, the second cross-entropy loss function, and the third cross-entropy loss function. The first preset coefficient can be one-third.

[0070] Optionally, the first cross-entropy loss function is determined based on the initial image features, initial text features, and initial cultural symbol features using the following formula, including:

[0071] in, This is used to represent the first cross-entropy loss function, which can be called the first cross-entropy contrastive loss function. and These represent the sample numbers, , and The first The cultural symbol features, image features, and text features of each sample and They represent the first Image and text features of each sample Used to represent the cosine similarity function. It is a hyperparameter used to control the smoothness of the loss function. The smaller the value, the more sensitive the model is to differences in similarity.

[0072] Optionally, the second cross-entropy loss function is determined based on the initial image features, initial text features, and initial cultural symbol features using the following formula, including:

[0073] in, This is used to represent the second cross-entropy loss function, which can also be called the second cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , and The first The cultural symbol features, image features, and text features of each sample and They represent the first Cultural symbolic features and image features of each sample.

[0074] Optionally, the third cross-entropy loss function is determined based on the initial image features, initial text features, and initial cultural symbol features using the following formula, including:

[0075] in, This is used to represent the third cross-entropy loss function, which can also be called the third cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , and The first The cultural symbol features, image features, and text features of each sample and They represent the first The image features and cultural symbol features of each sample. Furthermore, the triplet loss function can be expressed as... .

[0076] As an optional embodiment, a first quadruple loss function is constructed based on target image features, target text features, image interaction features, and text interaction features. This includes: determining a fourth cross-entropy loss function, a fifth cross-entropy loss function, a sixth cross-entropy loss function, and a seventh cross-entropy loss function based on the target image features, target text features, image interaction features, and text interaction features; and constructing the first quadruple loss function based on a second preset coefficient, the fourth cross-entropy loss function, the fifth cross-entropy loss function, the sixth cross-entropy loss function, and the seventh cross-entropy loss function.

[0077] In this embodiment, based on the target image features, target text features, image interaction features, and text interaction features, a fourth cross-entropy loss function, a fifth cross-entropy loss function, a sixth cross-entropy loss function, and a seventh cross-entropy loss function can be determined. Then, based on a second preset coefficient, the fourth cross-entropy loss function, the fifth cross-entropy loss function, the sixth cross-entropy loss function, and the seventh cross-entropy loss function, a first quaternion loss function is constructed. The second preset coefficient can be one-quarter.

[0078] Optionally, the fourth cross-entropy loss function is determined based on the target image features, target text features, image interaction features, and text interaction features using the following formula:

[0079] in, This is used to represent the fourth cross-entropy loss function, which can also be called the fourth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first The sample includes target image features enhanced with cultural symbols, text features enhanced with cultural symbols, image interaction features guided by text content, and text interaction features guided by image content. , and They represent the first The samples include text features enhanced by cultural symbols, image interaction features guided by text content, and text interaction features guided by image content.

[0080] Optionally, the fifth cross-entropy loss function is determined based on the target image features, target text features, image interaction features, and text interaction features using the following formula, including:

[0081] in, This is used to represent the fifth cross-entropy loss function, which can also be called the fifth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first The samples include image features enhanced with cultural symbols, text features enhanced with cultural symbols, image interaction features guided by text content, and text interaction features guided by image content. , and They represent the first The samples include image features enhanced with cultural symbols, image interaction features guided by text content, and text interaction features guided by image content.

[0082] Optionally, the sixth cross-entropy loss function is determined based on the target image features, target text features, image interaction features, and text interaction features using the following formula:

[0083] in, Used for the sixth cross-entropy loss function, which can also be called the sixth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first The samples include image features enhanced with cultural symbols, text features enhanced with cultural symbols, image interaction features guided by text content, and text interaction features guided by image content. , and They represent the first The samples include image features enhanced with cultural symbols, text features enhanced with cultural symbols, and text interaction features guided by image content.

[0084] Optionally, the seventh cross-entropy loss function is determined based on the target image features, target text features, image interaction features, and text interaction features using the following formula, including:

[0085] in, This is used to represent the seventh cross-entropy loss function, which can also be called the seventh cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first The sample includes target image features enhanced with cultural symbols, target text features enhanced with cultural symbols, image interaction features guided by text content, and text interaction features guided by image content. , and They represent the first The sample includes target image features enhanced with cultural symbols, text features enhanced with cultural symbols, and image interaction features guided by text content. Furthermore, the first quaternion loss function can be applied through... To express.

[0086] As an optional implementation, a second quaternion loss function is constructed based on image hash coding features, binary image features, text hash coding features, and binary text features. This includes: determining the eighth, ninth, tenth, and eleventh cross-entropy loss functions based on the image hash coding features, binary image features, text hash coding features, and binary text features; and constructing the second quaternion loss function based on the eighth, ninth, tenth, and eleventh cross-entropy loss functions using second preset coefficients.

[0087] In this embodiment, based on the image hash coding features, binary image features, text hash coding features, and binary text features obtained in the above steps, the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function are determined. Then, based on the second preset coefficients, the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function, the second quaternion loss function is constructed.

[0088] Optionally, the eighth cross-entropy loss function is determined based on image hash coding features, binary image features, text hash coding features, and binary text features using the following formula:

[0089] in, This is used to represent the eighth cross-entropy loss function, which can also be called the eighth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first Image continuous hashing, image binary hashing, text continuous hashing, and text binary hashing for each sample. , and They represent the first Image binary hashing, text continuous hashing, and text binary hashing for each sample.

[0090] Optionally, the ninth cross-entropy loss function is determined based on image hash coding features, binary image features, text hash coding features, and binary text features using the following formula, including:

[0091] in, This is used to represent the ninth cross-entropy loss function, which can also be called the ninth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first Image continuous hashing, image binary hashing, text continuous hashing, and text binary hashing for each sample. , and They represent the first Image continuous hashing, text continuous hashing, and text binary hashing of each sample.

[0092] Optionally, the tenth cross-entropy loss function is determined based on image hash coding features, binary image features, text hash coding features, and binary text features using the following formula, including:

[0093] in, This is used to represent the tenth cross-entropy loss function, which can also be called the tenth cross-entropy contrastive loss function. and These are used to represent the sample sequence number, , , and The first Image continuous hashing, image binary hashing, text continuous hashing, and text binary hashing for each sample. , and They represent the first Image continuous hash features, image binary hash features, and text binary hash features for each sample.

[0094] Optionally, the eleventh cross-entropy loss function is determined based on image hash coding features, binary image features, text hash coding features, and binary text features using the following formula, including:

[0095] in, This is used to represent the eleventh cross-entropy loss function, which can also be called the eleventh cross-entropy contrastive loss function. and These represent the sample numbers, , , and The first Image continuous hashing, image binary hashing, text continuous hashing, and text binary hashing for each sample. , and They represent the first The image continuous hash features, image binary hash encoding, and text continuous hash encoding of each sample. Furthermore, the second quaternion loss function can be obtained through... To express.

[0096] Alternatively, the objective function can be expressed by the following formula:

[0097] in, Learnable parameters used to represent cross-modal retrieval models; , and The weights of the loss function for the three learning objectives ( The first learning objective Based on cultural symbol characteristics Image features and text features The mean cross-entropy loss of the triplet is formed, where This represents the number of samples. Therefore As anchor points, and the same samples and As positive samples, and different samples and The cross-entropy contrastive loss function is used for negative samples. Therefore As an anchor point, with the same sample and Features are used as positive samples, while those of other samples are... and Cross-entropy contrastive loss function with negative samples as the feature. Therefore As an anchor point, with the same sample and As positive samples, and different samples and Cross-entropy contrastive loss function with negative samples as the feature.

[0098] Furthermore, the second learning objective Image features enhanced by cultural symbols Textual features enhanced by cultural symbols Text-based interactive features Text interaction features guided by image content The cross-entropy of the quaternion is compared with the mean loss. Therefore As an anchor point, with the same sample , and As a positive sample, and compared with other samples , and Cross-entropy contrastive loss function with negative samples as the feature. Therefore As an anchor point, similar to the original , and The feature is positive samples, while different samples , and Cross-entropy contrastive loss for negative samples. Therefore As an anchor point, with the same sample , and Features include positive samples and different samples , and Cross-entropy contrastive loss function with negative samples as the feature. Therefore As an anchor point, with the same sample , , Features are positive samples, and different samples , , Cross-entropy contrastive loss function with negative samples as the feature.

[0099] It should be noted that the third learning objective Based on , , and The similarity between features is determined by using a hypergraph convolutional neural network to learn the continuous hash code of the output image. Image binary hash coding Text continuous hash encoding Binary hash encoding of text The cross-entropy of the quaternion is compared with the mean loss. Therefore As anchor points, and the same samples , and As positive samples, and different samples , and The cross-entropy contrastive loss function is used for negative samples. Therefore As anchor points, and the same samples , and As positive samples, and different samples , and The cross-entropy contrastive loss function is used for negative samples. Therefore As anchor points, and the same samples , and As positive samples, and different samples , and The cross-entropy contrastive loss function is used for negative samples. Therefore As anchor points, and the same samples , and As positive samples, and different samples , and The cross-entropy contrastive loss function is used for negative samples.

[0100] By applying the technical solution of this embodiment, firstly, image information, text information, and cultural symbol information are obtained from cultural asset resources. Then, using a pre-trained multimodal large model, corresponding initial image features, initial text features, and initial cultural symbol features are extracted from the obtained image information, text information, and cultural symbol information. Next, the three initial features are respectively input into a multi-head cross-attention mechanism to obtain target image features, target text features, image interaction features, and text interaction features. Then, these four features are respectively input into a hypergraph convolutional neural network to obtain a first feature, a second feature, a third feature, and a fourth feature that further contain higher-order relationships. All the features extracted above are combined to determine the objective function to construct a cross-modal retrieval model. Finally, the information to be retrieved is input into the trained model to output the final retrieval result. Considering the use of multi-head cross-attention to acquire target image features, target text features, image interaction features, and text interaction features, and to explicitly extract cultural symbols containing the core value of cultural heritage and participate in cross-modal feature alignment, a hypergraph convolutional neural network is introduced to model high-order complex relationships and aggregate global information for target image features, target text features, and their image interaction features and text interaction features, thereby solving the technical problem of low cross-modal retrieval accuracy of cultural heritage and achieving the technical effect of improving the cross-modal retrieval accuracy of cultural heritage.

[0101] Furthermore, as a refinement and extension of the specific implementation methods of the above embodiments, in order to fully explain the specific implementation process of this embodiment, Figure 2 This document illustrates a flowchart of another cross-modal retrieval method for cultural heritage provided in an embodiment of this application. Figure 2 As shown, the method includes: Step S201: Extract image features, text features, and cultural symbol features based on a pre-trained multimodal large language model.

[0102] Step S202: Calculate the image features enhanced with cultural symbols, the text features enhanced with cultural symbols, the image interaction features guided by text content, and the text interaction features guided by image content using multi-head cross-attention.

[0103] Step S203: Based on feature similarity, obtain the image continuous hash code and the image binary hash code, as well as the text continuous hash code and the text binary hash code.

[0104] Step S204: Construct a multivariate cross-entropy contrastive loss function based on different features.

[0105] Step S205: Solve for the model parameters.

[0106] In this embodiment, the parameters of the cross-modal retrieval model for cultural asset resources based on gradient descent are estimated, including: Step 1: The number of iterations t=0, and the parameters are initialized, i.e., the parameters can be expressed by the following formula: Step 2: Randomly select from each batch of the training set. For each sample, calculate the loss function of the model. Step 3: Calculate the loss function Regarding parameters gradient Step 4: According to the step size Update parameters Step 5: Repeat steps 2 to 4 until the objective function converges or t is greater than the iteration threshold, then output the model parameters. .

[0107] Step S206: Calculate the Hamming distance.

[0108] In this embodiment, the final retrieval result can be determined using the obtained Hamming distance, and the process is as follows: Step 1, calculate the binary hash codes of all text and images in the retrieval set according to the model parameters; Step 2, if the search is for text using an image, calculate the binary hash code of the query image; if the search is for an image using text, calculate the binary hash code of the query text; Step 3, compare the binary code of the query object with the binary code of the retrieval set bit by bit, and count the number of identical bits as the similarity; Step 4, return the result with the highest similarity. One sample was used as the search result.

[0109] In this embodiment, firstly, image information, text information, and cultural symbol information are obtained from cultural asset resources. Then, using a pre-trained multimodal large model, corresponding initial image features, initial text features, and initial cultural symbol features are extracted from the obtained image information, text information, and cultural symbol information. Next, the three initial features are respectively input into a multi-head cross-attention mechanism to obtain target image features, target text features, image interaction features, and text interaction features. Then, these four features are respectively input into a hypergraph convolutional neural network to obtain a first feature, a second feature, a third feature, and a fourth feature that further contain higher-order relationships. All the features extracted above are combined to determine the objective function to construct a cross-modal retrieval model. Finally, the information to be retrieved is input into the trained model to output the final retrieval result. Considering the use of multi-head cross-attention to acquire target image features, target text features, image interaction features, and text interaction features, and to explicitly extract cultural symbols containing the core value of cultural heritage and participate in cross-modal feature alignment, a hypergraph convolutional neural network is introduced to model high-order complex relationships and aggregate global information for target image features, target text features, and their image interaction features and text interaction features, thereby solving the technical problem of low cross-modal retrieval accuracy of cultural heritage and achieving the technical effect of improving the cross-modal retrieval accuracy of cultural heritage.

[0110] In the embodiments of this application, as Figure 1 In the specific implementation of the method, in the embodiments of this application, Figure 3 This application provides a schematic diagram of the structure of a cross-modal retrieval device for cultural heritage, as illustrated in an embodiment of this application. Figure 3 As shown, the cross-modal retrieval device 300 for cultural heritage includes: a first acquisition unit 301, a second acquisition unit 302, a third acquisition unit 303, a fourth acquisition unit 304, a determination unit 305, and an analysis unit 306.

[0111] The first acquisition unit 301 is used to acquire image information, text information and cultural symbol information from cultural asset resources.

[0112] The second acquisition unit 302 is used to acquire initial image features, initial text features, and initial cultural symbol features from image information, text information, and cultural symbol information respectively using a pre-trained multimodal large language model.

[0113] The third acquisition unit 303 is used to input the initial image features, initial text features and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model to obtain target image features, target text features, image interaction features and text interaction features.

[0114] The fourth acquisition unit 304 is used to input the target image features, target text features, image interaction features and text interaction features into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature and the fourth feature.

[0115] The determining unit 305 is used to determine the objective function of the cross-modal retrieval model based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, first features, second features, third features, and fourth features, so as to construct the cross-modal retrieval model.

[0116] The analysis unit 306 is used to acquire the information to be retrieved and input the information to be retrieved into a pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

[0117] In this embodiment, a first acquisition unit acquires image information, text information, and cultural symbol information from cultural asset resources; a second acquisition unit uses a pre-trained multimodal large language model to acquire initial image features, initial text features, and initial cultural symbol features from the image information, text information, and cultural symbol information, respectively; a third acquisition unit inputs the initial image features, initial text features, and initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model, respectively, to obtain target image features, target text features, image interaction features, and text interaction features; a fourth acquisition unit inputs the target image features, target text features, image interaction features, and text interaction features into the multi-head cross-attention of the cross-modal retrieval model, respectively. In the hypergraph convolutional neural network of the modal retrieval model, the first, second, third, and fourth features are obtained. The objective function of the cross-modal retrieval model is determined by the determining unit based on the initial image features, initial text features, initial cultural symbol features, target image features, target text features, image interaction features, text interaction features, the first feature, the second feature, the third feature, and the fourth feature, thus constructing the cross-modal retrieval model. The information to be retrieved is obtained through the analysis unit and input into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results. This solves the technical problem of low accuracy in cross-modal retrieval of cultural heritage and achieves the technical effect of improving the accuracy of cross-modal retrieval of cultural heritage.

[0118] It should be noted that other corresponding descriptions of the functional units involved in the cross-modal retrieval device for cultural heritage provided in this application embodiment can be found in the following references. Figure 1 The corresponding descriptions in the method will not be repeated here.

[0119] This application also provides a computer device, which may specifically be a personal computer, a server, a network device, etc. Figure 4This application provides a schematic diagram of the device structure of a computer device according to an embodiment of the present application. Figure 4 As shown, the computer device includes a bus, a processor, memory, and a communication interface, and may also include an input / output interface and a display device. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database stores location information. The network interface allows communication with external terminals via a network connection. When the computer program is executed by the processor, it implements the steps in the various method embodiments.

[0120] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0121] In one embodiment, a computer-readable storage medium is provided, which may be non-volatile or volatile, having stored thereon a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0122] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0123] It should be noted that the user personal information involved in the embodiments of this application is all authorized (with the knowledge and consent) by the relevant parties or fully authorized by all parties, and the executing entity can obtain it through various legal and compliant means. The collection, storage, use, processing, transmission, provision, and disclosure of the information, data, and signals involved all comply with the relevant laws and regulations of the relevant countries and regions, and do not violate public order and good morals. It should be noted that if any software tools or components other than those of this company appear in the embodiments of this application, they are merely illustrative examples and do not represent actual use.

[0124] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0125] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0126] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A cross-modal retrieval method for cultural heritage, characterized in that, The method includes: Extracting image information, text information, and cultural symbol information from cultural heritage resources; Using a pre-trained multimodal large language model, initial image features, initial text features, and initial cultural symbol features are obtained from the image information, the text information, and the cultural symbol information, respectively. The initial image features, the initial text features, and the initial cultural symbol features are respectively input into the multi-head cross-attention of the cross-modal retrieval model to obtain target image features, target text features, image interaction features, and text interaction features; The target image features, target text features, image interaction features, and text interaction features are respectively input into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature; Based on the initial image features, the initial text features, the initial cultural symbol features, the target image features, the target text features, the image interaction features, the text interaction features, the first feature, the second feature, the third feature, and the fourth feature, the objective function of the cross-modal retrieval model is determined to construct the cross-modal retrieval model; The information to be retrieved is obtained and input into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

2. The method according to claim 1, characterized in that, The target image features, target text features, image interaction features, and text interaction features are respectively input into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, second feature, third feature, and fourth feature, including: Based on the target image features, the target text features, the image interaction features, and the text interaction features, a first correlation matrix, a second correlation matrix, a third correlation matrix, and a fourth correlation matrix are constructed respectively. The first correlation matrix, the second correlation matrix, the third correlation matrix, and the fourth correlation matrix are respectively input into the hypergraph convolutional neural network to obtain the first feature, the second feature, the third feature, and the fourth feature.

3. The method according to claim 1, characterized in that, Based on the initial image features, the initial text features, the initial cultural symbol features, the target image features, the target text features, the image interaction features, the text interaction features, the first feature, the second feature, the third feature, and the fourth feature, the objective function of the cross-modal retrieval model is determined to construct the cross-modal retrieval model, including: Based on the first feature and the third feature, image hash encoding features and binary image features are determined, and based on the second feature and the fourth feature, text hash encoding features and binary text features are determined. Based on the initial image features, the initial text features, the initial cultural symbol features, the target image features, the target text features, the image interaction features, the text interaction features, the image hash encoding features, the binary image features, the text hash encoding features, and the binary text features, the objective function is determined to construct the cross-modal retrieval model.

4. The method according to claim 3, characterized in that, Based on the first feature and the third feature, image hash coding features and binary image features are determined, including: The first feature and the third feature are fused together to obtain the first fused feature; The first fused feature is mapped to obtain the image hash encoding feature; The image hash coding features are transformed using a sign function to obtain the binary image features; Based on the second feature and the fourth feature, text hash encoding features and binary text features are determined, including: The second feature and the fourth feature are fused together to obtain the second fused feature; The second fusion feature is mapped to obtain the text hash encoding feature; The binary text features are obtained by converting the text hash encoding features using the symbol function.

5. The method according to claim 3, characterized in that, Based on the initial image features, the initial text features, the initial cultural symbol features, the target image features, the target text features, the image interaction features, the text interaction features, the image hash encoding features, the binary image features, the text hash encoding features, and the binary text features, the objective function is determined to construct the cross-modal retrieval model, including: Based on the initial image features, the initial text features, and the initial cultural symbol features, a triplet loss function is constructed. Based on the target image features, the target text features, the image interaction features, and the text interaction features, a first four-tuple loss function is constructed; Based on the image hash coding features, the binary image features, the text hash coding features, and the binary text features, a second four-tuple loss function is constructed; Based on the triplet loss function, the first quadruplet loss function, and the second quadruplet loss function, the objective function is determined to construct the cross-modal retrieval model.

6. The method according to claim 5, characterized in that, Based on the initial image features, the initial text features, and the initial cultural symbol features, a triplet loss function is constructed, including: Based on the initial image features, the initial text features, and the initial cultural symbol features, a first cross-entropy loss function, a second cross-entropy loss function, and a third cross-entropy loss function are determined. The triplet loss function is constructed based on the first preset coefficient, the first cross-entropy loss function, the second cross-entropy loss function, and the third cross-entropy loss function.

7. The method according to claim 5, characterized in that, Based on the target image features, the target text features, the image interaction features, and the text interaction features, a first four-tuple loss function is constructed, including: Based on the target image features, the target text features, the image interaction features, and the text interaction features, the fourth cross-entropy loss function, the fifth cross-entropy loss function, the sixth cross-entropy loss function, and the seventh cross-entropy loss function are determined. Based on the second preset coefficient, the first quadruple loss function is constructed using the fourth cross-entropy loss function, the fifth cross-entropy loss function, the sixth cross-entropy loss function, and the seventh cross-entropy loss function.

8. The method according to claim 5, characterized in that, Based on the image hash coding features, the binary image features, the text hash coding features, and the binary text features, a second four-tuple loss function is constructed, including: Based on the image hash coding features, the binary image features, the text hash coding features, and the binary text features, the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function are determined. Based on the second preset coefficient, the eighth cross-entropy loss function, the ninth cross-entropy loss function, the tenth cross-entropy loss function, and the eleventh cross-entropy loss function are used to construct the second quaternion loss function.

9. A cross-modal retrieval device for cultural heritage, characterized in that, The device includes: The first acquisition unit is used to acquire image information, text information, and cultural symbol information from cultural asset resources; The second acquisition unit is used to acquire initial image features, initial text features, and initial cultural symbol features from the image information, the text information, and the cultural symbol information respectively using a pre-trained multimodal large language model; The third acquisition unit is used to input the initial image features, the initial text features, and the initial cultural symbol features into the multi-head cross-attention of the cross-modal retrieval model, respectively, to obtain target image features, target text features, image interaction features, and text interaction features; The fourth acquisition unit is used to input the target image features, the target text features, the image interaction features, and the text interaction features into the hypergraph convolutional neural network of the cross-modal retrieval model to obtain the first feature, the second feature, the third feature, and the fourth feature, respectively. The determining unit is configured to determine the objective function of the cross-modal retrieval model based on the initial image features, the initial text features, the initial cultural symbol features, the target image features, the target text features, the image interaction features, the text interaction features, the first feature, the second feature, the third feature, and the fourth feature, so as to construct the cross-modal retrieval model; The analysis unit is used to acquire the information to be retrieved and input the information to be retrieved into the pre-trained cross-modal retrieval model for analysis to obtain the retrieval results.

10. A computer device, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method of any one of claims 1 to 8.