A cross-modal cultural and tourism knowledge graph construction system based on a large language model

By using multimodal data preprocessing and visual semantic anchoring techniques, combined with a visual evidence verification mechanism, the problems of shallow cross-modal associations and low accuracy in cultural and tourism knowledge graphs have been solved, and a high-precision cross-modal knowledge graph has been constructed, supporting fine-grained retrieval and reasoning.

CN122242686APending Publication Date: 2026-06-19HANGZHOU TIANMAI NETWORK

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU TIANMAI NETWORK
Filing Date
2026-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In the construction of existing cultural and tourism knowledge graphs, the deep semantic understanding of image data is insufficient, resulting in shallow cross-modal associations. Furthermore, the knowledge graphs generated by large language models have low accuracy and lack automated error correction mechanisms.

Method used

By preprocessing multimodal data, visual semantic anchoring, knowledge generation and verification, and dynamic graph updates, deep fusion of text and images is achieved. Visual evidence is used to verify and correct the consistency of knowledge generated by the large language model, thus constructing a high-precision cross-modal knowledge graph.

Benefits of technology

It achieves precise mapping between text attributes and image pixel regions, improving the accuracy and reliability of knowledge graphs and supporting fine-grained cross-modal retrieval and reasoning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure REF-OBJ-1774598319512-000002
    Figure REF-OBJ-1774598319512-000002
Patent Text Reader

Abstract

This application discloses a cross-modal cultural tourism knowledge graph construction system based on a large language model, comprising the following steps: S1: Data collection, text data cleaning and sentence segmentation, and image data normalization; S2: Inputting the preprocessed text sentences into the large language model, identifying and extracting entities and their corresponding visual attribute keywords, extracting image features using a visual encoder, calculating the weight distribution of text semantic vectors and image feature maps through a cross-modal attention mechanism, anchoring and aligning specific pixel regions, and outputting structured data containing entities, attribute keywords, and image region coordinates; S3: Generating candidate knowledge triples using the large language model, searching for the existence of visual anchors corresponding to the candidate knowledge triples, and verifying the consistency of the candidate knowledge triples using visual features; S4: Constructing a cross-modal cultural tourism knowledge graph containing visual nodes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge graph technology for smart cultural tourism. Background Technology

[0002] With the rapid development of smart tourism, building knowledge graphs containing rich semantic information has become key to improving service quality.

[0003] However, existing technologies have obvious technical shortcomings: On the one hand, existing cultural and tourism knowledge graph construction focuses on text information processing, and image data is usually only simply attached as an attribute of an entity, lacking a deep semantic understanding of the image content, resulting in shallow cross-modal association layers, which cannot meet the needs of fine-grained visual retrieval. On the other hand, when using large language models for knowledge extraction, the models are prone to generating illusory knowledge that does not conform to objective facts. Existing technologies lack mechanisms for automated error correction using multimodal information, resulting in low accuracy of the constructed knowledge graph and difficulty in guaranteeing data quality.

[0004] Therefore, there is an urgent need to develop a cross-modal cultural tourism knowledge graph construction system based on a large language model to solve the problems in the existing technology. Summary of the Invention

[0005] The purpose of this invention is to provide a cross-modal cultural tourism knowledge graph construction system based on a large language model, which can achieve deep fusion of text and images, solve the problem of shallow cross-modal association, effectively suppress the illusion of a large language model, and has a simple structure and is easy to use, so as to solve the problems mentioned in the background art.

[0006] To achieve the above objectives, the present invention provides the following technical solution: A cross-modal cultural tourism knowledge graph construction system based on a large language model includes the following steps: S1: Multimodal data preprocessing, collecting text and image data in the cultural and tourism field, cleaning and segmenting the text data, and normalizing the image data to obtain a standardized set of text sentences and image sets; S2: Visual semantic anchoring. The preprocessed text sentence is input into the large language model to identify and extract entities and their corresponding visual attribute keywords. Image features are extracted using a visual encoder. The weight distribution of the text semantic vector and the image feature map is calculated through a cross-modal attention mechanism. The text attribute keywords are anchored and aligned with specific pixel regions in the image. The output is structured data containing entities, attribute keywords and image region coordinates. S3: Knowledge generation and verification. The large language model is used to extract knowledge from the text data to generate candidate knowledge triples. The structured data output by S2 is searched to see if there are visual anchors corresponding to the candidate knowledge triples. The candidate knowledge triples are verified for consistency using visual features. When the verification result is inconsistent, the candidate knowledge triples are corrected based on visual evidence to obtain the verified knowledge triples. S4: Dynamic graph update, storing verified knowledge triples and corresponding image region coordinates and visual feature vectors into the graph database to construct a cross-modal cultural tourism knowledge graph containing visual nodes.

[0007] By adopting the above technical solution, a fine-grained association between text attributes and image regions is established through visual semantic anchoring technology. Visual evidence is used to verify and correct the knowledge generated by the large language model, thereby constructing a high-precision knowledge graph with cross-modal retrieval capabilities.

[0008] As a further aspect of the present invention: S1 specifically includes: The collected text data is processed to remove tags and garbled characters, and then segmented into sentences using a word segmentation tool to identify key sentences containing entity attribute descriptions. The acquired image data is deduplicated, and the image size is normalized and adjusted to meet the input standards of the visual model.

[0009] By adopting the above technical solutions, the text and image data are cleaned, deduplicated, and standardized, eliminating noise interference in the original data and unifying the data format, thus providing a standardized input data foundation for the efficient operation of subsequent large language and visual models.

[0010] As a further aspect of the present invention: In step S2, the specific process of extracting image features using a visual encoder and performing anchoring alignment through a cross-modal attention mechanism is as follows: A deep residual network is used as a visual encoder to extract image feature maps; Construct a large language model for prompt word input, and drive the model to output entity names and attribute keywords describing the visual appearance features of the entities; By using the text semantic vector as the query vector and the image feature map as the key vector and value vector, the bounding box coordinates of the text attribute keywords in the image are obtained by calculating the similarity between the query vector and the key vector and performing normalization.

[0011] By adopting the above technical solution, and through the combination of deep residual networks and cross-modal attention mechanisms, the association weights between text semantics and image features are accurately calculated, achieving pixel-level localization of text attribute keywords in images, and solving the problem that traditional methods can only perform image-level coarse-grained association.

[0012] As a further aspect of the present invention, the training process of the visual encoder is as follows: Construct a training sample set, which includes image data, descriptive text data, and coordinates of manually annotated visual feature region bounding boxes; The training samples are input into the model, and the localization loss between the bounding boxes predicted by the model and the ground truth bounding boxes is calculated. The localization loss is calculated by a weighted combination of the L1 loss function and the generalized intersection-union loss function. The optimizer iteratively updates the model parameters based on the localization loss until the loss function converges.

[0013] By adopting the above technical solution, constructing a training sample set containing manually labeled information, and using a combined loss function to iteratively train the visual encoder, the model is able to accurately understand text instructions and locate the corresponding regions in the image, thus ensuring the robustness of visual semantic anchoring.

[0014] As a further aspect of the present invention: the balance coefficient in the localization loss is set to 1.5; the initial learning rate is set to 0.0001.

[0015] By adopting the above technical solution and specifically setting the balance coefficient and training learning rate in the localization loss function, the gradient descent path in the model training process is optimized, which improves the accuracy of visual region localization while ensuring the model convergence speed.

[0016] As a further aspect of the present invention: In step S3, the specific process of using visual features to verify the consistency of candidate knowledge triples is as follows: For the generated candidate knowledge triples, search whether there is corresponding visual anchor data; A contrastive language-image pre-trained model is used to encode the text attributes and image features of the anchored regions in the candidate knowledge triples, and the semantic similarity between the two is calculated. The calculated semantic similarity is compared with a preset confidence threshold. If the semantic similarity is higher than the first preset threshold, the verification is deemed successful; if the semantic similarity is lower than the second preset threshold, a conflict is deemed to exist and a correction mechanism is triggered.

[0017] By adopting the above technical solution and introducing a contrastive language-image pre-trained model to calculate semantic similarity, the abstract knowledge verification process is transformed into a quantifiable numerical comparison, thereby realizing the automated verification of subjective text generation results using objective visual features.

[0018] As a further aspect of the present invention: the first preset threshold is set to 0.85; the second preset threshold is set to 0.4.

[0019] By adopting the above technical solution and setting specific first and second preset thresholds, a clear judgment boundary is provided for knowledge verification, ensuring the rigor of the verification mechanism and effectively identifying and filtering out illusory knowledge with low confidence.

[0020] As a further aspect of the present invention: the correction mechanism specifically comprises: When a conflict is detected, a correction instruction containing conflict information is generated; The correction instructions are fed back to the large language model, which then regenerates or deletes the attribute based on visual evidence, and outputs the corrected knowledge triplet.

[0021] By adopting the above technical solution, a closed loop for reverse correction of textual knowledge based on visual evidence is constructed. When knowledge conflict is detected, it can drive the large language model to self-correct based on objective facts, which significantly reduces the error rate in the knowledge graph.

[0022] As a further aspect of the present invention: S4 specifically includes: Align the entities in the validated knowledge triples with the existing nodes in the graph database; if a node does not exist, create a new node. The image region coordinates and visual feature vectors generated by S2 are stored as node attributes. Construct and update a multimodal index for entities, which supports retrieving image regions by text keywords and retrieving text knowledge by image regions.

[0023] By adopting the above technical solution, and storing image region coordinates and visual feature vectors in the graph database, the limitation of traditional graphs that only store text information is broken, and knowledge graphs are endowed with the ability to support fine-grained visual retrieval and reasoning.

[0024] As a further aspect of the present invention, it also includes: The multimodal data preprocessing module is used to collect and process text and image data in the cultural and tourism field, and output a standardized dataset. The visual semantic anchoring module is used to extract visual attribute keywords using a large language model and to achieve anchoring alignment between text attributes and image regions through a visual encoder and a cross-modal attention mechanism. The knowledge generation and verification module is used to generate candidate knowledge triples and to perform consistency verification and correction using the data output by the visual semantic anchoring module. The graph dynamic update module is used to store verified knowledge and visual features into the graph database, thereby completing the construction of a cross-modal cultural tourism knowledge graph.

[0025] By adopting the above technical solutions, the composition and logical relationship of each functional module are defined at the system architecture level, and the methods and steps are transformed into specific execution units, ensuring the collaborative operation and automated operation of each link of the cross-modal knowledge graph construction system.

[0026] Compared with the prior art, the beneficial effects of the present invention are: This invention uses visual semantic anchoring technology to accurately map text attributes to image pixel regions, solving the problem of shallow cross-modal association in existing technologies and realizing fine-grained cross-modal retrieval and knowledge association. This invention utilizes a visual feature verification mechanism to verify and correct the consistency of textual knowledge generated by a large language model using objective image evidence. This effectively suppresses the illusion phenomenon of large models and significantly improves the accuracy and credibility of knowledge graphs.

[0027] Other features and advantages of the present invention will be disclosed in detail in the following detailed description and accompanying drawings. Attached Figure Description

[0028] Figure 1 This is a schematic diagram of the steps in a cross-modal cultural tourism knowledge graph construction system based on a large language model, according to an embodiment of the present invention. Detailed Implementation

[0029] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0030] In this embodiment of the invention, a cross-modal cultural tourism knowledge graph construction system based on a large language model is described below. Figure 1 As shown, it includes a multimodal data preprocessing module, a visual semantic anchoring module, a knowledge generation and verification module, and a dynamic graph update module, wherein: S1: Multimodal data preprocessing The system first collects raw data from the cultural and tourism sector through a multimodal data preprocessing module.

[0031] For text data, the system performs HTML tag removal and character cleanup operations, and uses a word segmentation tool to process sentences and filter out key sentences containing entity attribute descriptions.

[0032] For image data, the system performs deduplication to eliminate redundancy and normalizes image dimensions, for example, by adjusting the shorter side of the image to a fixed pixel size to meet the input standards of subsequent visual models. After this step, the system outputs a standardized set of text sentences and an image set.

[0033] S2: Visual semantic anchoring This step is the core of this embodiment and aims to establish a fine-grained association between text and images.

[0034] S21: The system inputs the preprocessed text sentences into the large language model.

[0035] Based on the constructed prompt words, the model identifies and outputs entity names and attribute keywords describing the visual appearance features of the entities. For example, it extracts the entity "Yellow Crane Tower" and the attribute keyword "flying eaves and upturned corners" from "Yellow Crane Tower has a roof with upturned eaves".

[0036] S22: The system uses a deep residual network as a visual encoder to extract features from the image set and generate image feature maps.

[0037] S23: The system utilizes a cross-modal attention mechanism for anchoring alignment.

[0038] In this process, the text semantic vector is used as the query vector, and the image feature map is used as the key vector and value vector. The system calculates the similarity between the query vector and the key vector and performs normalization processing to determine the bounding box coordinates of the text attribute keywords in the image.

[0039] In order to ensure the positioning accuracy of the visual encoder, the system needs to pre-train it.

[0040] The training process is as follows: 1. Construct a training sample set containing image data, descriptive text data, and manually annotated coordinates of visual feature region bounding boxes.

[0041] 2. Input the training samples into the model and calculate the localization loss between the bounding boxes predicted by the model and the ground truth bounding boxes.

[0042] 3. The positioning loss is calculated by a weighted combination of the L1 loss function and the generalized intersection-union loss function.

[0043] 4. Use the optimizer to iteratively update the model parameters based on the localization loss until the loss function converges.

[0044] In one specific implementation, the balance coefficient in the localization loss is set to 1.5, and the initial learning rate is set to 0.0001 to optimize the training effect.

[0045] S24: This module outputs structured data containing entities, attribute keywords, and image region coordinates.

[0046] S3: Knowledge Generation and Verification This step utilizes large language models to generate knowledge and incorporates visual evidence for quality control.

[0047] 3.1 Use a large language model to extract knowledge from text data and generate candidate knowledge triples.

[0048] 3.2 The structured data output by the system retrieval step S2 is used to search for whether there is visual anchor data corresponding to the candidate knowledge triple.

[0049] 3.3 The system uses a contrastive language-image pre-trained model to encode the textual attributes and image features of the anchored regions in the candidate knowledge triples, and calculates the semantic similarity between them.

[0050] The system compares the calculated semantic similarity with a preset confidence threshold for consistency verification.

[0051] In one specific implementation, the first preset threshold is set to 0.85. If the semantic similarity is higher than this value, the verification is deemed successful. The second preset threshold is set to 0.4. If the semantic similarity is lower than this value, a conflict is deemed to exist and a correction mechanism is triggered.

[0052] The specific execution process of the correction mechanism is as follows: when a conflict is determined, the system generates a correction instruction containing conflict information and feeds it back to the large language model, which drives the large language model to regenerate or delete the attribute based on visual evidence, and finally outputs the corrected knowledge triplet.

[0053] S4: Dynamic Map Updates This step is responsible for storing the verified knowledge into the graph database, thus completing the graph construction.

[0054] The system aligns the entities in the verified knowledge triples with the existing nodes in the graph database; if a node does not exist, it creates a new node.

[0055] Unlike traditional storage methods, the system stores the image region coordinates and visual feature vectors generated in step S2 as attributes of the nodes.

[0056] Simultaneously, the system constructs and updates a multimodal index for entities, which supports retrieving image regions by text keywords and retrieving text knowledge by image regions, thereby realizing the construction and application of a cross-modal cultural tourism knowledge graph.

[0057] This invention provides a cross-modal cultural tourism knowledge graph construction system based on a large language model, which can achieve deep fusion of text and images, solve the problem of shallow cross-modal associations, and effectively suppress the illusion of a large language model.

[0058] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.

[0059] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

Claims

1. A cross-modal cultural tourism knowledge graph construction system based on a large language model, characterized in that, Includes the following steps: S1: Multimodal data preprocessing, collecting text and image data in the cultural and tourism field, cleaning and segmenting the text data, and normalizing the image data to obtain a standardized set of text sentences and image sets; S2: Visual semantic anchoring. The preprocessed text sentence is input into the large language model to identify and extract entities and their corresponding visual attribute keywords. Image features are extracted using a visual encoder. The weight distribution of the text semantic vector and the image feature map is calculated through a cross-modal attention mechanism. The text attribute keywords are anchored and aligned with specific pixel regions in the image. The output is structured data containing entities, attribute keywords and image region coordinates. S3: Knowledge generation and verification. The large language model is used to extract knowledge from the text data to generate candidate knowledge triples. The structured data output by S2 is searched to see if there are visual anchors corresponding to the candidate knowledge triples. The candidate knowledge triples are verified for consistency using visual features. When the verification result is inconsistent, the candidate knowledge triples are corrected based on visual evidence to obtain the verified knowledge triples. S4: Dynamic graph update, storing verified knowledge triples and corresponding image region coordinates and visual feature vectors into the graph database to construct a cross-modal cultural tourism knowledge graph containing visual nodes.

2. The cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 1, characterized in that, S1 specifically includes: The collected text data is processed to remove tags and garbled characters, and then segmented into sentences using a word segmentation tool to identify key sentences containing entity attribute descriptions. The acquired image data is deduplicated, and the image size is normalized and adjusted to meet the input standards of the visual model.

3. The cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 1, characterized in that, In step S2, the specific process of extracting image features using a visual encoder and performing anchoring alignment through a cross-modal attention mechanism is as follows: A deep residual network is used as a visual encoder to extract image feature maps; Construct a large language model for prompt word input, and drive the model to output entity names and attribute keywords describing the visual appearance features of the entities; By using the text semantic vector as the query vector and the image feature map as the key vector and value vector, the bounding box coordinates of the text attribute keywords in the image are obtained by calculating the similarity between the query vector and the key vector and performing normalization.

4. The cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 3, characterized in that, The training process of the visual encoder is as follows: Construct a training sample set, which includes image data, descriptive text data, and coordinates of manually annotated visual feature region bounding boxes; The training samples are input into the model, and the localization loss between the bounding boxes predicted by the model and the ground truth bounding boxes is calculated. The localization loss is calculated by a weighted combination of the L1 loss function and the generalized intersection-union loss function. The optimizer iteratively updates the model parameters based on the localization loss until the loss function converges.

5. The cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 4, characterized in that, The balance coefficient in the localization loss is set to 1.5; the initial learning rate is set to 0.0001.

6. The cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 1, characterized in that, In step S3, the specific process of using visual features to verify the consistency of candidate knowledge triples is as follows: For the generated candidate knowledge triples, search whether there is corresponding visual anchor data; The contrastive language-image pre-trained model is used to encode the text attributes and image features of the anchored regions in the candidate knowledge triples, and the semantic similarity between the two is calculated. The calculated semantic similarity is compared with a preset confidence threshold. If the semantic similarity is higher than the first preset threshold, the verification is deemed successful. If the semantic similarity is lower than the second preset threshold, a conflict is determined and a correction mechanism is triggered.

7. A cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 6, characterized in that, The first preset threshold is set to 0.85; the second preset threshold is set to 0.

4.

8. A cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 6, characterized in that, The correction mechanism is specifically as follows: When a conflict is detected, a correction instruction containing conflict information is generated; The correction instructions are fed back to the large language model, which then regenerates or deletes the attribute based on visual evidence, and outputs the corrected knowledge triplet.

9. A cross-modal cultural tourism knowledge graph construction system based on a large language model according to claim 1, characterized in that, S4 specifically includes: Align the entities in the validated knowledge triples with the existing nodes in the graph database; if a node does not exist, create a new node. The image region coordinates and visual feature vectors generated by S2 are stored as node attributes. Construct and update a multimodal index for entities, which supports retrieving image regions by text keywords and retrieving text knowledge by image regions.

10. A cross-modal cultural tourism knowledge graph construction system based on a large language model according to any one of claims 1-9, characterized in that, Also includes: The multimodal data preprocessing module is used to collect and process text and image data in the cultural and tourism field, and output a standardized dataset. The visual semantic anchoring module is used to extract visual attribute keywords using a large language model and to achieve anchoring alignment between text attributes and image regions through a visual encoder and a cross-modal attention mechanism. The knowledge generation and verification module is used to generate candidate knowledge triples and to perform consistency verification and correction using the data output by the visual semantic anchoring module. The graph dynamic update module is used to store verified knowledge and visual features into the graph database, thereby completing the construction of a cross-modal cultural tourism knowledge graph.