A mongolian cross-modal content semantic retrieval and intent matching system

The Mongolian cross-modal content semantic retrieval and intent matching system solves the problems of multiple writing systems and non-standardization of ancient book images in Mongolian, and realizes efficient retrieval and accurate matching of Mongolian multimodal cultural resources, adapting to the intelligent service needs under low resource conditions.

CN122240812APending Publication Date: 2026-06-19INNER MONGOLIA MENKSOFT SOFTWARE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INNER MONGOLIA MENKSOFT SOFTWARE
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing cross-modal retrieval technologies suffer from problems such as semantic fragmentation across multiple writing systems, non-standardization of ancient book images, and insufficient representation of culturally loaded words when processing Mongolian texts, making it difficult to meet the needs for efficient retrieval of multimodal cultural resources in Mongolian.

Method used

A Mongolian cross-modal content semantic retrieval and intent matching system was designed, including a Mongolian multimodal preprocessing module, a Mongolian cross-modal semantic mapping module, a Mongolian multimodal semantic feature library, a Mongolian retrieval intent-specific parsing module, and a cross-modal semantic matching module. Through multi-writing system representation alignment, ancient book image processing, dialect speech standardization, culturally enhanced semantic mapping, and dynamic differential thresholding strategies, cross-modal semantic unification and accurate matching are achieved.

Benefits of technology

It solves the problems of semantic fragmentation in the multiple writing systems of Mongolian and non-standardization of ancient book images, improves the semantic representation ability of culturally loaded words, realizes efficient retrieval and accurate matching of multimodal cultural resources in Mongolian, and adapts to the intelligent service needs under low resource conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240812A_ABST
    Figure CN122240812A_ABST
Patent Text Reader

Abstract

This invention relates to the field of Mongolian information processing and cross-modal semantic retrieval technology, and discloses a Mongolian cross-modal content semantic retrieval and intent matching system. The system employs a Mongolian multimodal preprocessing module to normalize and structure texts from multiple writing systems, ancient book images, and dialect speech; a Mongolian cross-modal semantic mapping module, based on a pre-trained model enhanced with cultural knowledge, maps multimodal features to a unified semantic space; a Mongolian multimodal semantic feature library stores and indexes semantic vectors; a dedicated Mongolian retrieval intent parsing module identifies cultural intent and encodes query vectors; a cross-modal semantic matching module uses a dynamic differentiated threshold strategy for similarity calculation and matching; and a Mongolian scene feedback optimization module collects deep interaction data and updates model parameters and the rule base through incremental learning. The system achieves accurate and intelligent retrieval of Mongolian multimodal cultural resources, significantly improving the retrieval accuracy of Mongolian content.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cross-modal retrieval technology, and in particular to a Mongolian cross-modal content semantic retrieval and intent matching system. Background Technology

[0002] Currently, with the rapid development of digital technology, multimodal data is experiencing explosive growth, giving rise to cross-modal retrieval technology. Existing cross-modal retrieval methods typically use deep learning models to map data from different modalities to the same semantic space, thereby enabling mutual retrieval between different modalities such as text, images, and speech. Simultaneously, information processing technologies for the Mongolian language have also developed, mainly including Mongolian speech recognition, text translation, and dialogue systems. These existing technologies have promoted the development of Mongolian information processing to some extent, but most are geared towards standardized modern Mongolian, primarily addressing the recognition and conversion of single modalities. Furthermore, they rely on large-scale labeled data and struggle to adapt to the challenges of multiple Mongolian writing systems, the complex forms of ancient texts, the profound semantics of culturally loaded words, and data scarcity. Therefore, they cannot meet the demand for efficient retrieval of multimodal Mongolian cultural resources.

[0003] The core problem with the aforementioned existing technologies is that they fail to provide a cross-modal retrieval scheme that fully adapts to the characteristics of Mongolian script. Specifically, they lack effective solutions to key issues such as semantic fragmentation caused by multiple writing systems of Mongolian script, processing failures caused by non-standardization of ancient book images, matching deviations caused by dialectal phonetic variations, insufficient semantic representation of culturally loaded words, and difficulties in parsing users' unique search intentions.

[0004] Publication No. CN113515952B discloses a joint modeling method for Mongolian dialogue models. This method classifies original Mongolian sentences and determines dialogue scenarios by establishing a dictionary database, a grammar rule database, a dialogue scenario classification model, and a target language model, thereby achieving translation between Mongolian and the target language. This technology provides a solution to the problems of dialogue scenario classification and fuzzy matching in Mongolian, but its core lies in dialogue translation rather than cross-modal retrieval. Furthermore, it primarily processes standard modern Mongolian text and speech, without addressing the semantic unification issues of multiple Mongolian writing systems, nor providing processing solutions for the special forms of ancient Mongolian text images.

[0005] Publication number CN114780777B discloses a cross-modal retrieval method based on semantic enhancement. This method extracts multi-layer semantic information from images and text, including instance-level semantics, relation-level semantics, and attribute-level semantics, and uses this semantic information to enhance the feature representation of multi-modal data, achieving fine-grained cross-modal alignment based on multi-layer semantics. This technology effectively improves retrieval accuracy in the general image-text retrieval field, but its model training relies on large-scale image-text alignment data and is not optimized for the low resource characteristics of Mongolian; it also lacks the ability to process multiple writing systems of Mongolian, and cannot achieve semantic alignment of synonyms and variant texts; furthermore, this method does not introduce a cultural load word enhancement mechanism, and its semantic representation ability for cultural entities such as historical figures and ancient book terms in Mongolian is insufficient, making it difficult to meet the accuracy requirements of cultural retrieval. Summary of the Invention

[0006] The technical problem this invention aims to solve is that existing cross-modal retrieval technologies suffer from core issues when processing Mongolian text, such as semantic fragmentation across multiple writing systems, non-standardization of ancient book images, and insufficient representation of culturally loaded words. To address these issues, we propose a Mongolian cross-modal content semantic retrieval and intent matching system.

[0007] To achieve the above objectives, this application adopts the following technical solution: a Mongolian cross-modal content semantic retrieval and intent matching system, comprising: The Mongolian multimodal preprocessing module is used to perform Mongolian-specific normalization and structuring processing on the input raw data. The raw data includes texts in multiple writing systems such as traditional Mongolian, Cyrillic Mongolian, and Tod, images of ancient Mongolian books, and Mongolian dialect speech. The normalization and structuring processing includes semantic alignment processing that is independent of the writing system of the texts, enhancement and character recognition processing of images of ancient Mongolian books, and transcription and semantic normalization processing of Mongolian dialect speech.

[0008] The Mongolian cross-modal semantic mapping module, connected to the preprocessing module, is used to encode and map multimodal features of text, images, and speech to a unified low-dimensional semantic space.

[0009] The Mongolian multimodal semantic feature library, connected to the semantic mapping module, is used to store and index the semantic feature vectors of all content generated by the module.

[0010] The Mongolian search intent parsing module is used to receive and parse the user's search request and encode it into a query vector.

[0011] The cross-modal semantic matching module is connected to the semantic feature library and the intent parsing module, respectively. It is used to calculate and match the similarity between the query vector and the multimodal semantic vectors in the feature library, and output a list of multimodal retrieval results.

[0012] Preferably, the Mongolian multimodal preprocessing module includes a multi-writing system characterization and alignment submodule: The multi-writing system representation alignment submodule is specifically designed to address the problem of the discreteness of the same Mongolian semantic entity in the vector space due to differences in writing systems in cross-modal retrieval scenarios.

[0013] The submodule is trained through a contrastive learning mechanism, which employs a triplet loss function or a contrastive loss function. Its optimization objective is to bring the vector representations of text pairs with the same Mongolian semantic content but different writing styles closer together in the shared Mongolian semantic subspace, while simultaneously widening the vector distance between text pairs with different semantic content.

[0014] The output of the submodule is a Mongolian semantic vector decoupled from the original writing form, which serves as the input of the Mongolian cross-modal semantic mapping module, ensuring that the same Mongolian content in different writing systems has a consistent semantic encoding starting point.

[0015] Preferably, the Mongolian multimodal preprocessing module includes a full-link processing submodule for Mongolian ancient book images: The full-link processing submodule is a processing unit that is sequentially integrated to address the characteristics of Mongolian ancient books, such as vertical layout, blurriness, incompleteness, and lack of punctuation. The processing unit sequentially performs image degradation correction, geometric correction and line segmentation, dedicated OCR recognition, and ancient book text structuring.

[0016] The processing unit includes a special character recognition model for Mongolian ancient books. The special character recognition model for Mongolian ancient books adopts a sequence recognition architecture based on deep learning. Its vocabulary is specifically expanded based on prior knowledge of variant characters and phonetic loan characters in Mongolian ancient books, and is used to recognize Mongolian ancient book characters containing variant characters and phonetic loan characters.

[0017] Furthermore, the processing unit also includes a Mongolian ancient text text structuring model based on sequence labeling. The Mongolian ancient text text structuring model adopts a bidirectional long short-term memory network or a Transformer architecture and is trained on a dataset that labels the punctuation and word segmentation boundaries of ancient texts. It is used to perform punctuation and word segmentation on the identified long texts without punctuation in accordance with the grammar of Mongolian ancient texts.

[0018] Furthermore, the Mongolian ancient book image enhancement unit and semantic enhancement method also include: based on the characteristics of the connected strokes and direction of Mongolian characters, the geometric direction correction not only performs global rotation but also local deformation correction. By detecting the skeleton lines of the connected components of the characters and fitting curves, the local distortion field is calculated to reverse-repair the local character deformation caused by paper wrinkles. In the noise reduction process, a threshold segmentation algorithm adapted to the contrast features of Mongolian ink marks and background is used to enhance specific color channels of yellowed and faded ancient book backgrounds and dark ink marks to more accurately separate the strokes of Mongolian characters.

[0019] Furthermore, the semantic enhancement also includes: integrating a visual-text association enhancer for Mongolian ancient book terms into the end-to-end processing submodule. After OCR recognition, the enhancer performs secondary matching and verification between the image region features of the identified suspected proper nouns from ancient books and the text description vectors of the corresponding terms in the Mongolian semantic knowledge base. If the confidence level is lower than the threshold, a context-based glyph review is triggered to improve the accuracy of key culturally loaded words recognition. The proper nouns from ancient books include tribal names and ancient place names.

[0020] Preferably, the Mongolian multimodal preprocessing module includes a Mongolian dialect transcribing text semantic normalization submodule: The normalization submodule is specifically designed to eliminate the interference caused by Mongolian dialect lexical variations on cross-modal semantic matching. The Mongolian dialects include Oirat, Bargu, and Khalkha dialects, etc., in order to achieve semantic alignment between dialect speech content and the standard Mongolian text library.

[0021] The normalization submodule is implemented through a sequence conversion model, which adopts an encoder-decoder architecture. At the encoding end, contextual information is fused, and at the decoding end, standard Mongolian written language is generated. The model is trained on parallel corpora of Mongolian dialect spoken language and standard Mongolian written language. Using contextual information, the transcribed text containing dialect feature words is converted into text that conforms to the standard Mongolian written language norm.

[0022] Preferably, the Mongolian cross-modal semantic mapping module adopts a dual-tower contrastive learning architecture, including a content tower and a query tower: The content tower comprises three parallel modality-specific coding branches: a text coding branch, an image coding branch, and a speech coding branch. The text coding branch employs a pre-trained language model enhanced with Mongolian cultural knowledge, the image coding branch uses a visual Transformer architecture, and the speech coding branch uses an acoustic feature extraction model. The outputs of each branch are mapped to the same dimension through a linear projection layer and then fused through a learnable weighted fusion layer to obtain the content vector.

[0023] The query tower contains a text encoder with the same text encoding branch structure as the content tower, which takes the user query text as input and outputs a query vector.

[0024] The content tower and query tower are optimized through comparative learning during the training phase, so that different modal vectors of the same semantic content are close in semantic space.

[0025] Preferably, the Mongolian cross-modal semantic mapping module includes a pre-trained language model enhanced with Mongolian cultural knowledge: The pre-trained language model is trained using a strategy that includes a Mongolian cultural term masking prediction task. The task involves randomly masking terms in the input text that belong to a pre-built Mongolian cultural entity library and using contextual information to train the model to predict the masked entities. This injects structured knowledge of Mongolian historical and cultural entities into the model parameters, enabling the model to learn the semantic representation of cultural entities.

[0026] The semantic mapping module uses a loss function during training that includes a Mongolian cultural semantic enhancement term. This term is used to identify shared Mongolian cultural entities between sample pairs during model training and thereby enhance their vector association strength in a unified semantic space.

[0027] Furthermore, the cross-modal self-supervised contrastive learning strategy specifically includes: Using Mongolian OCR text, standard transcribed text, and corresponding audio readings derived from the same Mongolian ancient text, image-text-speech triplets are constructed as positive sample pairs for self-supervised training.

[0028] The strategy employs a multi-stage training approach: In the first stage, a basic Mongolian pre-trained language model is trained using a masked language modeling task with a large-scale Mongolian single-modal corpus; in the second stage, the text encoder is fixed, and the image encoder and speech encoder are trained using the triplet data with a contrastive learning task, aligning their output vectors to the corresponding text vectors; in the third stage, all encoders are jointly fine-tuned, and a cross-modal alignment loss function for Mongolian cultural entities is introduced. When calculating negative samples within a batch, this loss function imposes a stricter penalty on negative sample pairs containing the same Mongolian cultural terms, forcing the model to learn deeper Mongolian multimodal semantic associations beyond shallow features.

[0029] Preferably, the Mongolian language retrieval intent dedicated parsing module includes a Mongolian language cultural intent classification submodule: The cultural intent classification submodule can identify Mongolian-characteristic cultural retrieval intents, which include Mongolian ancient book retrieval, Mongolian epic query, and Mongolian historical figure query. The cultural intent classification submodel is trained on a Mongolian query dataset labeled with cultural intents, using a text convolutional neural network or a lightweight Transformer.

[0030] The cross-modal semantic matching module is configured to execute a dynamic differential thresholding strategy, setting independent similarity filtering thresholds for each modality candidate result based on the Mongolian cultural intent classification results and the differences in feature quality and reliability of different modalities.

[0031] Preferably, the Mongolian scene feedback optimization module is used to collect Mongolian scene deep interaction data, which includes the deep browsing time of Mongolian ancient book images and the annotation behavior of specific Mongolian cultural content.

[0032] The feedback optimization module is configured to use the deep interaction data as high-quality positive samples to perform small-sample, low-perturbation incremental parameter updates on the Mongolian cross-modal semantic mapping module at a learning rate far lower than that of standard training. The incremental parameter updates employ elastic weight consolidation or knowledge distillation techniques to avoid catastrophic forgetting.

[0033] The feedback optimization module is also configured to trigger an update of the rule base used by the Mongolian multimodal preprocessing module based on the user's direct error correction behavior on the results of variant character recognition and dialect transcription.

[0034] Furthermore, the pre-trained models used in the Mongolian cross-modal semantic mapping module and the Mongolian retrieval intent-specific parsing module are lightweight Mongolian-specific models obtained through model compression and knowledge transfer. The lightweight model is obtained from a large pre-trained model containing Mongolian cultural knowledge through a knowledge transfer method, which includes knowledge distillation, model pruning, or parameter quantization. The knowledge transfer process focuses on enabling the lightweight model to inherit and maintain its understanding and representation ability of Mongolian cultural load words and unique semantic structures.

[0035] The lightweight model is specifically designed based on the characteristics of the Mongolian language and the requirements of the target deployment environment. It features reduced model complexity and number of parameters, and is optimized through model compression technology to adapt to efficient deployment and real-time inference in computing-constrained environments.

[0036] The technical effects and advantages of this invention are as follows: In this invention, the Mongolian multimodal preprocessing module completes the normalization and structuring of texts from multiple writing systems, images of ancient books, and dialectal speech, solving the problems of unreadable data and semantic fragmentation caused by chaotic writing, non-standard ancient book forms, and dialectal differences in the original data; the Mongolian cross-modal semantic mapping module completes the deep semantic encoding of multimodal content by constructing a unified semantic space with cultural enhancement, solving the problem of weak semantic representation of Mongolian cultural load words and lack of cross-modal association under low resource conditions; the Mongolian search intent-specific parsing module completes the accurate understanding and cultural intent classification of users' colloquial and mixed queries, solving the problem that general semantic matching cannot adapt to the unique search habits and cultural needs of Mongolian users; the cross-modal semantic matching module completes the accurate screening and sorting of multimodal results through a dynamic differentiated threshold strategy, solving the matching bottleneck that a single threshold cannot take into account the quality differences of different modalities; and the Mongolian scene feedback optimization module completes small-sample incremental learning and rule updating based on deep interactive behavior, solving the key problem that the system cannot continuously self-evolve under low resource scenarios. The system as a whole realizes intelligent services for Mongolian multimodal cultural resources from perception and understanding to retrieval, and establishes a complete technical paradigm for the deep adaptation and practical application of cross-modal semantic retrieval technology in low-resource ethnic language and cultural scenarios. Attached Figure Description

[0037] The disclosure of this invention is illustrated with reference to the accompanying drawings. It should be understood that the drawings are for illustrative purposes only and are not intended to limit the scope of protection of this invention. In the drawings, the same reference numerals are used to refer to the same parts: Figure 1 This is a system block diagram of the present invention; Figure 2 This is a schematic diagram of the architecture of the Mongolian multimodal preprocessing module of the present invention; Figure 3 This is a schematic diagram of the process of Embodiment 1 of the present invention; Figure 4 This is a schematic diagram of the process of Embodiment 2 of the present invention. Detailed Implementation

[0038] It is readily understood that, based on the technical solution of this invention, those skilled in the art can propose various interchangeable structural methods and implementations without altering the essential spirit of the invention. Therefore, the following detailed embodiments and accompanying drawings are merely illustrative examples of the technical solution of this invention and should not be considered as the entirety of the invention or as limitations or restrictions on the technical solution of this invention.

[0039] This invention provides a Mongolian cross-modal content semantic retrieval and intent matching system. Addressing inherent technical challenges in Mongolian writing systems, such as the disorganization of multiple writing systems, the scarcity of high-quality annotated data, the non-standardization of ancient texts, and the deep semantics of culturally loaded words, this system constructs an end-to-end specialized technical framework. This system is not merely a simple parameter adaptation of a general multimodal retrieval framework, but rather a deep transformation of the entire chain from data input, feature representation, semantic understanding to interactive feedback. Figure 1 As shown, the core components of the system include a Mongolian multimodal preprocessing module, a Mongolian cross-modal semantic mapping module, a Mongolian multimodal semantic feature library, a dedicated Mongolian retrieval intent parsing module, a cross-modal semantic matching module, and an optional Mongolian scene feedback optimization module. The specific implementation methods of each module are described in detail below.

[0040] 1. Mongolian multimodal preprocessing module This module serves as the system's data entry point and the foundation for quality assurance. Its design stems from a deep consideration of the unique characteristics of Mongolian data. General preprocessing tools face three major limitations when processing Mongolian text: first, they cannot handle the semantic fragmentation caused by the coexistence of multiple writing systems such as traditional Uyghur Mongolian, Cyrillic Mongolian, and Tod script; second, they are powerless to process vertically formatted, blurry, or incomplete images of ancient books; and third, they exhibit serious errors in transcribing the pronunciation of dialects such as Oirat and Bargut. Therefore, this module must normalize and structure the raw data from the source.

[0041] This module integrates three dedicated sub-processing units, whose organizational structure is as follows: Figure 2 As shown. Figure 2 The parallel processing relationships between the multi-writing system representation alignment submodule, the ancient book image full-link processing submodule, and the dialect transcription text semantic normalization submodule are clearly shown, as well as the detailed data flow within each submodule.

[0042] The multi-writing system representation alignment submodule takes as input text any of the following writing systems: traditional Uyghur Mongolian, Cyrillic Mongolian, or Tod script. This text may originate from digitized documents, web scraping, or user input. The processing is based on contrastive learning and includes the following steps: First, a dual-tower neural network is constructed, with two identical towers but independent parameters, each processing text from a different writing system. Training is performed using a small amount of aligned bilingual corpus, such as traditional Mongolian and Cyrillic Mongolian versions of the same news content. During training, semantically identical sentence pairs are used as positive samples, and semantically different sentence pairs as negative samples. The distance between positive sample pairs and negative sample pairs is increased by optimizing the triplet loss function or contrastive loss function. The margin of the triplet loss function is set to 0.2. After training, for any single writing system input text, after encoding by the corresponding tower, the output is a unified Mongolian semantic vector decoupled from specific characters. Thus, text from traditional Mongolian writing systems... Synonymous variants of the Cyrillic Mongolian script Хан obtained a consistent semantic starting point in subsequent system processing.

[0043] The input to the ancient book image end-to-end processing submodule is scanned, vertically aligned, blurred, or incomplete images of ancient books in lossless formats such as TIFF or PNG, with a resolution of at least 300 DPI to preserve details. The processing follows a dedicated, cascaded pipeline.

[0044] First, image enhancement and geometric correction are performed. Specifically, an ink smear segmentation model based on the U-Net architecture is used to separate the ink smear from the background. This model consists of an encoder and a decoder. The encoder contains four downsampling stages, each using two 3×3 convolutions and a ReLU activation function, with channel numbers of 64, 128, 256, and 512 respectively. Each stage is followed by a 2×2 max pooling layer; the bottleneck layer has 1024 channels. The decoder contains four upsampling stages, using 2×2 transposed convolutions, concatenated with the corresponding encoder features, and then followed by two 3×3 convolutions. The model is trained on a dataset of 5000 manually annotated Mongolian ancient book ink smear images, using a hybrid loss function of Dice loss and binary cross-entropy loss, with Adam as the optimizer, an initial learning rate of 1e-4, and 100 epochs of training. The trained model effectively solves problems such as paper yellowing and ink smudging. Subsequently, using projection transformation and an algorithm based on the prior of Mongolian vertical layout, the Radon transform is used to detect the main tilt angle of the text lines and perform rotation correction. Then, the projection profile method combined with the line spacing features of Mongolian vertical layout is used to complete accurate text line segmentation.

[0045] In geometric correction, considering the characteristics of connected strokes and directionality of Mongolian characters, not only global rotation but also local deformation correction is performed. Specifically, by detecting the skeleton lines of the connected components of the characters and fitting curves, the local distortion field is calculated to reverse-correct local character deformations caused by paper wrinkles. In noise reduction, a threshold segmentation algorithm adapted to the contrast features of Mongolian ink marks and the background is used to enhance specific color channels of yellowed and faded ancient book backgrounds and dark ink marks, in order to more accurately separate the strokes of Mongolian characters.

[0046] Next, a dedicated OCR recognition was performed. The corrected text line images were input into a dedicated OCR model for Mongolian ancient books. This model uses TrOCR as the baseline architecture, with its visual encoder employing a Vision Transformer to segment the image into a 16×16 patch sequence, which outputs image features after 12 layers of Transformer encoding. The text decoder uses a 6-layer Transformer decoder to generate character sequences in an autoregressive manner. The model's output vocabulary was specially expanded, adding approximately 2000 variant characters and phonetic loan characters from ancient books to the general Mongolian vocabulary, bringing the total vocabulary size to 5000 characters. The model was trained on a dataset of 100,000 ancient book line images paired with standard text. This dataset was constructed through a combination of manual annotation and data augmentation, including operations such as blur simulation, rotation, and incompleteness simulation. Training used the cross-entropy loss function, the AdamW optimizer, a learning rate of 5e-5, a batch size of 32, and a total of 50 epochs.

[0047] Finally, text structuring is performed. For the long, punctuation-free strings output by OCR, a Bi-LSTM-based sequence labeling model is used for automatic sentence segmentation and preliminary word segmentation conforming to the grammar of ancient Mongolian texts. The model takes a sequence of character embeddings as input and outputs a label for each character, including categories such as word beginning, word middle, word end, sentence beginning, sentence middle, and sentence end. The model consists of two bidirectional LSTM layers with a hidden layer dimension of 256, followed by a fully connected layer and a Softmax output. Training data is generated through rule simulation based on common sentence-final function words and phrase structure patterns found in the grammar of ancient Mongolian texts.

[0048] In terms of semantic enhancement, this module also integrates a visual-text association enhancer for Mongolian ancient text terms. After OCR recognition, this enhancer performs a secondary matching and verification between the image region features of the identified suspected proper nouns from ancient texts and the text description vectors of the corresponding terms in the Mongolian semantic knowledge base. Proper nouns from ancient texts include tribal names, ancient place names, etc. If the cosine similarity is lower than a preset threshold of 0.7, a context-based glyph review is triggered, and alternative recognition results are called for re-evaluation, thereby improving the accuracy of key culturally loaded words recognition. As a result, this submodule ultimately outputs clean, structured Mongolian text and a deep visual feature vector extracted from the intermediate layer of the U-Net encoder, wherein the visual feature vector has a dimension of 512.

[0049] The dialect transcription text semantic normalization submodule takes as input the initial transcribed text of Oirat, Barhu, and other dialects from a general speech recognition model. This speech recognition model employs an end-to-end Transformer-based architecture, trained on a dataset containing 300 hours of multi-dialect speech, and outputs the original transcribed text. This initial transcribed text often contains dialect vocabulary, such as the Oirat dialect word… "Corresponding standard language" The processing is accomplished by a sequence-to-sequence model, where both the encoder and decoder are 3-layer Transformers with 256 hidden layers and 4 attention heads. The model is trained on a dataset of 50,000 parallel sentences in both spoken and standard Mongolian written language, constructed through manual collection and alignment. Training uses labeled smooth cross-entropy loss, with Adam as the optimizer, a learning rate of 3e-4, a batch size of 64, and a total of 30 epochs. Decoding employs a beam search strategy with a width of 4. Thus, this submodule outputs text conforming to standard written language specifications, eliminating the lexical gap caused by dialect variations in subsequent semantic matching.

[0050] 2. Mongolian cross-modal semantic mapping module This module, acting as the system's understanding hub, is responsible for transforming preprocessed multimodal data into machine-understandable representations rich in cultural semantics. General multimodal models suffer from severely insufficient representational capabilities in this scenario due to a lack of Mongolian language knowledge. The root cause lies in the scarcity of high-quality labeled data for Mongolian, a low-resource language, and the difficulty in obtaining sufficient training on culturally loaded words within general corpora. Therefore, this module must design a specialized knowledge injection mechanism and semantic alignment strategy to construct a unified vector space capable of deeply understanding Mongolian cultural semantics.

[0051] This module employs a dual-tower contrastive learning architecture, consisting of a content tower and a query tower. The content tower processes the multimodal content to be retrieved, generating content vectors and storing them in a feature library; the query tower processes the user's input retrieval request, generating query vectors for matching. Its detailed architecture is as follows... Figure 3 As shown.

[0052] The content tower contains three parallel modality-specific coding branches: text coding branch, image coding branch, and speech coding branch.

[0053] The text encoding branch is centered around a pre-trained Mongolian language model enhanced with cultural knowledge. This model employs a Transformer architecture, specifically configured with 12 Transformer encoding layers, each hidden layer with 768 dimensions, 12 attention heads, a 3072-dimensional feedforward network, and a maximum processable sequence length of 512 tokens. The model's vocabulary has been specially expanded, adding 15,000 Mongolian-specific sub-words to the general multilingual model vocabulary, bringing the total vocabulary size to 265,002 lexical units.

[0054] The image encoding branch adopts the Vision Transformer architecture, which segments the input ancient book image into a 16×16 patch sequence, obtains a 768-dimensional vector through linear projection, and then outputs the image features after 12 layers of Transformer encoding.

[0055] The speech coding branch uses the HuBERT acoustic model to extract speech features and outputs a 1024-dimensional feature vector. Then, a fully connected layer is used to reduce the dimension to align with the text features.

[0056] The outputs of the three branches are each passed through an independent linear projection layer, mapping the features to a unified 256-dimensional dimension. Finally, a learnable weighted fusion layer fuses the three features to obtain the final content vector representation. The fusion weights are three learnable scalar parameters, automatically updated through backpropagation, with initial values ​​all set to 1.0. After Softmax normalization, they are used for weighted summation.

[0057] The query tower contains only text-encoded branches, and its structure is exactly the same as the text-encoded branches of the content tower, but its parameters are independent. The input to the query tower is the user's query text, and the output is a 256-dimensional query vector.

[0058] The training of this module is divided into two stages. The first stage is the pre-training stage, which is used to enable the text encoder to acquire Mongolian cultural semantic knowledge. Pre-training is based on multilingual models such as XLM-RoBERTa, and is further trained on 20GB of cleaned mixed text of modern Mongolian and ancient text transcriptions. In addition to standard masked language modeling, the pre-training task also introduces a Mongolian cultural term mask prediction task. Specifically, random masks in the training corpus belong to the vocabulary of a pre-constructed Mongolian cultural entity library, which contains approximately 12,000 cultural entities such as historical figures, tribes, place names, and ancient text terms. The model is forced to predict the masked entities using context. This process internalizes structured cultural knowledge into the model parameters. Pre-training uses the AdamW optimizer with a learning rate of 5e-5, a batch size of 256, a sequence length of 512, and 100,000 training steps.

[0059] The second stage is the contrastive learning training stage, used to construct a unified semantic space across modalities. Training employs a cross-modal self-supervised contrastive learning strategy, eliminating the need for expensive manual alignment and annotation. It utilizes only image-text-speech triplets naturally derived from the same ancient text unit as positive samples, constructing approximately 500,000 triplets in total. This strategy employs a multi-stage training approach: The first stage: using a large-scale Mongolian single-modal corpus, a masked language modeling task is used to train a basic Mongolian pre-trained language model, which is the pre-training stage mentioned above.

[0060] The second stage involves fixing the text encoder and training the image encoder and speech encoder using the triplet data, aligning their output vectors to the corresponding text vectors. During training, a contrastive loss is used to narrow the distance between source image features and text features, and between speech features and text features.

[0061] The third stage involves jointly fine-tuning all encoders and introducing a cross-modal alignment loss function for Mongolian cultural entities. This loss function imposes a stricter penalty on negative sample pairs containing the same Mongolian cultural terms when computing negative samples within a batch, forcing the model to learn deeper Mongolian multimodal semantic associations beyond shallow features.

[0062] The specific loss function is the improved InfoNCE loss, which includes the basic contrastive loss and a cultural semantic enhancement term, and its mathematical expression is as follows: , in For standard comparison loss, This is the cultural enhancement term. The cultural enhancement term is calculated as follows: for each positive sample pair in the training batch, first identify the number of entities they share in the cultural entity database. Then, when calculating the similarity, a bonus is given to the cosine similarity of the positive sample pair; that is, the adjusted similarity is the original similarity plus the weight. and The product of Set it to 0.1. To balance the weights, the value was set to 0.3. Contrastive learning training used the AdamW optimizer, with a learning rate of 3e-5 for the text branch and 1e-4 for the image and speech branches. The batch size was 128, the temperature coefficient was 0.05, and the training lasted for 50 epochs. Gradient clipping was applied, and the maximum norm was 1.0.

[0063] Through the above design, this module achieves deep semantic understanding of Mongolian multimodal content. On the input side, the text encoding branch receives the Mongolian text sequence standardized by the preprocessing module, the image encoding branch receives enhanced and corrected images of ancient books, and the speech encoding branch receives speech features normalized to the local dialect. On the output side, the content tower outputs a 256-dimensional floating-point vector representing the position of the multimodal content in a unified semantic space; the query tower outputs a query vector of the same dimension, used for similarity calculation with the content vector. The cultural terminology mask prediction task injects Mongolian cultural entity knowledge into the model, and the cultural semantic enhancement term explicitly brings multimodal content involving the same cultural entities closer together during contrastive learning. The design of the image and speech encoding branches fully considers the characteristics of Mongolian ancient books and dialects, ensuring effective alignment of features across modalities in a unified space.

[0064] 3. Mongolian Multimodal Semantic Feature Library This module serves as the system's memory repository, responsible for efficiently storing and organizing all feature vectors generated by the semantic mapping module. General systems using content IDs or simple keyword indexes cannot meet the demands of cultural queries for rapid location and semantic relevance. Therefore, this module must be designed with an index structure that conforms to the characteristics of Mongolian culture.

[0065] This feature library is implemented using the high-performance vector database FAISS. Before being added to the library, the system automatically clusters Mongolian cultural themes using the K-means clustering algorithm based on the distribution of vectors in a unified semantic space. The number of clusters is set to 100, and the cultural themes include categories such as epic literature, historical classics, religious philosophy, and folk art. Each cluster is assigned a cultural semantic label derived from the cultural entities that appear frequently in that cluster. This indexing method based on cultural themes differs fundamentally from general systems; it can significantly accelerate the retrieval speed for cultural queries and improve the cultural relevance of the results. When new Mongolian content is processed by the aforementioned modules, its feature vectors are stored in the library in real time, and the index is updated. The index update adopts an incremental approach to avoid the computational overhead of full reconstruction.

[0066] 4. Dedicated parsing module for Mongolian search intent This module serves as the intelligent interface for system-user interaction, specifically designed to parse the complex yet concise search habits of Mongolian users. General intent recognition models are ineffective against short queries, colloquial expressions, mixed Mongolian and Chinese input, and cultural metaphors, stemming from a lack of understanding of the unique search behavior of Mongolian users. Therefore, this module must be designed with a parsing mechanism tailored to the characteristics of Mongolian. The internal organizational structure of this module is as follows: Figure 4 As shown, it includes a mixed language parsing submodule, a semantic completion submodule, and a cultural intent classification submodule.

[0067] The module workflow is as follows. First, the mixed language parsing submodule processes the user's original query, such as the short phrase " The module allows users to perform mixed Mongolian-Chinese searches for "Genghis Khan's laws." This submodule separates the Mongolian and Chinese text using a dictionary matching method. The dictionary contains a mapping between commonly used Mongolian and Chinese vocabulary. For the identified Chinese text, a bilingual terminology database of approximately 50,000 entries is used for translation. The identified Mongolian text is then retained.

[0068] Next, the semantic completion submodule expands the separated Mongolian short query. This submodule utilizes historical query logs and a semantic knowledge base for completion. The historical query logs record the user's past search behavior, from which high-frequency co-occurring word pairs are extracted. The semantic knowledge base contains semantic relationships between Mongolian words, such as hyponyms, synonyms, and relatedness. The completion process employs a statistical collaborative filtering algorithm to calculate the relevance between the current query term and candidate expanded terms, selecting the top three terms with the highest relevance for expansion.

[0069] Finally, the cultural intent classification submodule classifies the completed pure Mongolian queries. This submodule uses a TextCNN model based on multi-scale convolution. The model structure consists of three convolutional layers with kernel sizes of 3, 4, and 5, and 128 kernels per layer, followed by a global max pooling layer and a fully connected layer. The output layer uses the Softmax activation function to output the class probability. The model is trained on a labeled Mongolian query dataset containing 10,000 Mongolian queries. Each query is labeled with its cultural intent category by linguistic experts, including five categories: ancient book retrieval, epic queries, historical figures, folklore, and modern information. During training, the cross-entropy loss function is used, the optimizer is Adam, the learning rate is 1e-3, the batch size is 32, and the training is conducted for 20 epochs. An early stopping mechanism is used to prevent overfitting, with a patience value of 3 epochs. The accuracy of the trained model reaches over 92%.

[0070] Finally, the categorized query text is transformed into a query semantic vector through a text encoder in the Mongolian cross-modal semantic mapping module. This text encoder is the aforementioned culture-enhanced pre-trained model. Thus, the system can accurately capture user queries using phrases like "..." "Refers to deep cultural intentions such as "epic hero".

[0071] 5. Cross-modal semantic matching module This module serves as the system's decision engine, responsible for accurately locating relevant information within the feature library. General systems employ a single, fixed threshold, which cannot simultaneously account for the differences in feature quality across different modalities and the requirements of cultural relevance. Therefore, this module must design a dynamically differentiated matching strategy.

[0072] The retrieval process employs a two-stage strategy, including coarse screening and fine ranking. The coarse screening stage utilizes the clustering index of the feature library to quickly locate the cultural theme clusters most relevant to the query intent. For example, if the query is categorized as "ancient book retrieval," the search is prioritized within the "historical classics" cluster. The fine ranking stage calculates the cosine similarity between the query vector and all candidate multimodal feature vectors within this cluster.

[0073] This section introduces a dynamic, differentiated threshold strategy. Instead of using a single, fixed threshold, the system dynamically sets adaptive similarity filtering thresholds for candidate results across different modalities, based on the cultural intent category of the query and the inherent feature quality and noise levels of each modality. For example, considering potential recognition errors in OCR of ancient book images, the similarity threshold for the image modality is set slightly lower than that for the text modality; considering that even after dialect speech normalization, a small amount of error may remain, the threshold for the speech modality is set slightly lower than that for the text modality but higher than that for the image modality. Threshold adjustments are made based on historical retrieval performance statistics and are updated every two weeks. Simultaneously, in the final ranking score, results sharing cultural entities with the query are given additional weight. This additional weight is calculated based on the number of shared entities, with each shared entity adding a 0.05 similarity bonus. Through this process, the final output of the system is a list aggregated from multiple modalities (image, text, and audio) and sorted by relevance, with highly culturally relevant content prioritized.

[0074] 6. Mongolian language scene feedback optimization module This module endows the system with the ability to continuously evolve, and specifically designs an efficient closed-loop optimization mechanism to address the scarcity of Mongolian data. General-purpose systems rely on large amounts of labeled data for periodic retraining, which makes effective iteration difficult in low-resource scenarios. Therefore, this module must design a few-shot learning and real-time rule update mechanism based on deep behavioral data.

[0075] The module continuously collects deep interaction data from users on the results page. This data is specific to the Mongolian language context, including zooming in on an ancient book image for more than ten seconds, repeatedly playing a segment of dialect audio more than twice, and actively labeling search results as culturally relevant or irrelevant. These deep signals reflect users' true satisfaction more accurately than simple clicks. The system compiles high-quality positive feedback samples into small-batch training data weekly. Positive feedback samples are defined as results of deep user interaction and their corresponding queries. The model parameters in the Mongolian cross-modal semantic mapping module are incrementally fine-tuned with an extremely low learning rate of 1e-6. During fine-tuning, only the parameters of the projection and fusion layers are updated, while the underlying Transformer encoder remains fixed to avoid catastrophic forgetting. To enhance the stability of incremental learning, elastic weight consolidation techniques can be used to constrain important parameters and prevent them from deviating from their original values; or knowledge distillation techniques can be used, with the original model acting as the teacher to guide student models in learning new samples while retaining old knowledge.

[0076] Meanwhile, users' direct corrections to OCR variant character recognition or dialect transcription results immediately trigger online updates to the corresponding rule base or dictionary in the Mongolian multimodal preprocessing module. Specifically, when a user corrects a variant character recognition error, the system adds the mapping relationship between the variant character and the correct standard character to the variant character mapping table; when a user corrects a dialect transcription error, the system adds the correspondence between the dialect word and the standard word to the dialect vocabulary mapping table. These mapping relationships are officially adopted only after accumulating confidence and reaching a certain threshold, in order to avoid interference from single errors.

[0077] Furthermore, to adapt to the low-computing-power environment of grassroots venues in ethnic minority areas, this module also supports converting the core semantic model into a lightweight version through model compression and knowledge transfer techniques. The model compression techniques include knowledge distillation, model pruning, and parameter quantization.

[0078] Knowledge distillation: A large pre-trained model enhanced with cultural knowledge is used as the teacher model, and a miniaturized Transformer is used as the student model. The student model is configured with 6 encoding layers, 384 hidden layers, and 8 attention heads. During distillation, both soft-label loss and hard-label loss are used. The soft-label loss allows the student model to mimic the output distribution of the teacher model, while the hard-label loss ensures that the student model remains accurate on the true labels.

[0079] Model pruning: Perform structured pruning on the trained model to remove redundant attention heads and neurons, thereby reducing the number of parameters.

[0080] Parameter quantization: The model weights are quantized from 32-bit floating-point to 8-bit integers, further compressing the model size.

[0081] After the above compression, the final model size is reduced by 70%, the inference speed is increased by 200%, and real-time retrieval response in the hundreds of milliseconds can be achieved on edge devices such as Jetson Nano, while ensuring that the core capabilities of cultural semantic understanding are not distorted.

[0082] Through the collaborative work of the above modules, the present invention systematically constructs a complete technical closed loop from data understanding, semantic representation, intent parsing to result matching and continuous optimization, providing a specialized solution for the digital utilization of Mongolian multimodal cultural resources.

[0083] Example 1: like Figure 3 As shown, this embodiment demonstrates the system's core search capabilities when the feedback optimization module is not enabled. For example, to find information about "Genghis Khan's ascension to the throne" in an ancient book called *The Secret History of the Mongols*, but the user's memory is vague, they entered a mixed query... "Ascension pictures and recordings". This query has typical characteristics of mixed Mongolian and Chinese and short queries, and involves multi-modal requirements such as ancient book images and voices.

[0084] Step 1: Processing by the Mongolian multi-modal dedicated preprocessing module The query entered by the user first enters the Mongolian multi-modal dedicated preprocessing module. This module does not simply perform character conversion, but through the built-in multi-writing system representation alignment sub-module, it performs semantic unification processing on the Mongolian part of the query " ". Regardless of whether this vocabulary exists in the form of traditional Mongolian or Cyrillic Mongolian, the sub-module maps it to a shared semantic vector decoupled from the writing form, ensuring that subsequent processing is not interfered by the appearance of the text. At the same time, the Chinese part "Ascension" in the query is accurately translated into the corresponding Mongolian vocabulary through the Chinese-Mongolian bilingual term library and merged with the original Mongolian part to form a complete standard Mongolian query phrase. This process solves the fundamental defect of the prior art that synonymous different texts cannot be associated due to ignoring the multi-writing system.

[0085] Step 2: Processing by the Mongolian retrieval intention dedicated parsing module The merged standard Mongolian query is sent to the Mongolian retrieval intention dedicated parsing module. This module first uses the cultural intention classification model to identify the intention of the query. The model is trained based on a large number of pre-annotated Mongolian query corpora and can accurately judge that this query belongs to the composite cultural intention of "ancient book retrieval" combined with "historical figures". Subsequently, the module performs semantic complementation on the short query. For example, it associates "Genghis Khan" and "Ascension" with specific ancient book names such as "The Secret History of the Mongols" and further expands relevant semantic clues such as "the ceremony of the Great Khan's accession to the throne". General intention matching systems often simply split such queries into keywords, resulting in semantic loss, while this module realizes the accurate capture of the user's deep needs through cultural knowledge injection and context understanding.

[0086] Step 3: Collaboration of the Mongolian cross-modal semantic mapping module The parsed query vectors are fed into a unified semantic space constructed by the Mongolian cross-modal semantic mapping module. This space was formed through self-supervised learning from a large amount of source Mongolian ancient texts during the offline phase, where features from text, image, and speech modalities are mapped to the same low-dimensional vector space. Notably, a cultural semantic enhancement loss function was employed during training, resulting in close clustering of different modalities involving the same cultural entities, such as "Genghis Khan" and "coronation ceremony," within the space. For example, illustrations depicting the coronation scene in ancient texts, related Mongolian textual records, and recordings of experts recounting the event, despite their different forms, are semantically adjacent to each other.

[0087] Step 4: Retrieval by the cross-modal semantic matching module After receiving the query vector, the cross-modal semantic matching module first performs a rapid coarse screening in the Mongolian multimodal semantic feature library. This feature library has been clustered by cultural theme, so the system can quickly locate the clusters related to "historical classics" and "imperial biographies". In the fine ranking stage, the module applies a dynamic differentiated threshold strategy: considering that there may be OCR recognition errors in ancient book images, the similarity threshold for image modality is set slightly lower than that for text modality. At the same time, according to the cultural intent category of the query, candidate results containing the core cultural entity "Genghis Khan" are given additional weight. Finally, the system outputs a multimodal result list sorted by relevance, including ancient book images of relevant pages of "The Secret History of the Mongols", Mongolian original text passages recording the coronation ceremony, and audio narration of the event generated by speech synthesis or real recording. Users can obtain complete information with images, text, and audio in one stop without switching between multiple platforms.

[0088] Example 2: like Figure 4 As shown, this embodiment, based on Embodiment 1, introduces a Mongolian scene feedback optimization module to form a self-evolving closed-loop system. When a user engages deeply with an ancient book image in the returned results after a single search—by zooming in to view image details and lingering for more than fifteen seconds—this behavior is considered a strong positive feedback signal by the system.

[0089] Step 5: Intervention of the Mongolian Scene Feedback Optimization Module The Mongolian scene feedback optimization module continuously collects in-depth user interaction data, including but not limited to image viewing time, number of times audio is repeated, and active annotation of search results. This data differs from the simple click logs relied upon by general systems; it more realistically reflects users' level of interest in specific Mongolian cultural content. Once the module accumulates a sufficient number of high-quality positive feedback samples, the system triggers an incremental update process: fine-tuning the top-level parameters of the Mongolian cross-modal semantic mapping module with an extremely low learning rate, further optimizing the feature representations related to the query in the semantic space. For example, if multiple users show deep interest in images of ancient books related to Genghis Khan, the system will automatically increase the ranking weight of the image modality in similar queries.

[0090] Step Six: Real-time Updates to the Rule Base Furthermore, the user's error correction behavior regarding the recognition results is also captured by the feedback module. Suppose that in this retrieval, the OCR module misidentifies a variant character in an ancient book image as a standard character, and the user corrects it using the system's error correction function. This correction instruction will immediately trigger an update to the variant character mapping rule library in the Mongolian multimodal preprocessing module, ensuring improved recognition accuracy for subsequent similar characters. General systems typically rely on periodic retraining to correct errors, while this invention achieves rapid rule iteration through a real-time feedback mechanism, making it particularly suitable for the special scenario of Mongolian ancient books, where data is sparse and variant characters are abundant.

[0091] Step 7: Adapting the lightweight model To meet the deployment needs of grassroots venues in ethnic minority areas, this optimized embodiment also supports a lightweight transformation of the core semantic model. Through knowledge transfer technology, the large pre-trained model enhanced with cultural knowledge is compressed into a lightweight version suitable for edge computing devices. This version retains the original model's core understanding of Mongolian culturally loaded words, but significantly reduces the number of parameters and improves inference speed, enabling real-time retrieval responses even in computing-constrained environments. This expands the application scope of this invention from the cloud to a wider range of grassroots cultural institutions.

[0092] The technical scope of this invention is not limited to the content described above. Those skilled in the art can make various modifications and variations to the above embodiments without departing from the technical concept of this invention, and all such modifications and variations should fall within the protection scope of this invention.

Claims

1. A Mongolian cross-modal content semantic retrieval and intent matching system, characterized in that, include: The Mongolian multimodal preprocessing module is used to perform Mongolian-specific normalization and structuring processing on the input raw data. The raw data includes texts in multiple writing systems such as traditional Mongolian, Cyrillic Mongolian, and Tod, images of ancient Mongolian books, and Mongolian dialect speech. The normalization and structuring processing includes semantic alignment processing that is independent of the writing system of the texts, enhancement and character recognition processing of images of ancient Mongolian books, and transcription and semantic normalization processing of Mongolian dialect speech. The Mongolian cross-modal semantic mapping module, connected to the preprocessing module, is used to encode and map multimodal features of text, images, and speech to a unified low-dimensional semantic space. The Mongolian multimodal semantic feature library, connected to the semantic mapping module, is used to store and index the semantic feature vectors of all content generated by the module; A dedicated parsing module for Mongolian search intent is used to receive and parse the user's search request and encode it into a query vector; The cross-modal semantic matching module is connected to the semantic feature library and the intent parsing module, respectively. It is used to calculate and match the similarity between the query vector and the multimodal semantic vectors in the feature library, and output a list of multimodal retrieval results.

2. The Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The Mongolian multimodal preprocessing module includes a multi-writing system characterization and alignment sub-module: The multi-writing system representation alignment submodule is specifically designed to solve the problem of the discreteness of the same Mongolian semantic entity in the vector space due to differences in writing systems in cross-modal retrieval scenarios. The submodule is trained through a contrastive learning mechanism, which uses a triplet loss function or a contrastive loss function. Its optimization objective is to make the distance between the vector representations of text pairs that come from the same Mongolian semantic content but have different writing forms approach each other in the shared Mongolian semantic subspace, while making the vector distance between text pairs with different semantic content widen. The output of the submodule is a Mongolian semantic vector decoupled from the original writing form, which serves as the input of the Mongolian cross-modal semantic mapping module, ensuring that the same Mongolian content in different writing systems has a consistent semantic encoding starting point.

3. The Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The Mongolian multimodal preprocessing module includes a full-link processing submodule for Mongolian ancient book images: The full-link processing submodule is a processing unit that is sequentially integrated to address the characteristics of Mongolian ancient books being vertically formatted, blurry, incomplete, and without punctuation. The processing unit sequentially performs image degradation correction, geometric correction and line segmentation, dedicated OCR recognition, and ancient book text structuring. The processing unit includes a special character recognition model for Mongolian ancient books. The special character recognition model for Mongolian ancient books adopts a sequence recognition architecture based on deep learning. Its vocabulary is expanded in a targeted manner based on prior knowledge of variant characters and phonetic loan characters in Mongolian ancient books, and is used to recognize Mongolian ancient book characters containing variant characters and phonetic loan characters. The processing unit also includes a Mongolian ancient text structure model based on sequence labeling. The Mongolian ancient text structure model adopts a bidirectional long short-term memory network or a Transformer architecture and is trained on a dataset with labeled sentence breaks and word segmentation boundaries of ancient texts. It is used to perform sentence breaks and word segmentation in accordance with the grammar of Mongolian ancient texts on the identified long texts without punctuation. The Mongolian ancient book image enhancement unit and semantic enhancement method further include: based on the characteristics of the connected strokes and direction of Mongolian characters, the geometric direction correction not only performs global rotation but also local deformation correction. By detecting the skeleton lines of the connected components of the characters and fitting curves, the local distortion field is calculated to reverse repair the local character deformation caused by paper wrinkles. In the noise reduction process, a threshold segmentation algorithm adapted to the contrast features of Mongolian ink marks and background is used to enhance specific color channels of yellowed and faded ancient book backgrounds and dark ink marks to more accurately separate the strokes of Mongolian characters. The semantic enhancement further includes: integrating a visual-text association enhancer for Mongolian ancient book terms in the full-link processing submodule. After OCR recognition, the enhancer performs secondary matching and verification between the image region features of the identified suspected proper nouns from ancient books and the text description vectors of the corresponding terms in the Mongolian semantic knowledge base. If the confidence level is lower than the threshold, a context-based glyph review is triggered to improve the accuracy of key culturally loaded words recognition. The proper nouns from ancient books include tribal names and ancient place names.

4. The Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The Mongolian multimodal preprocessing module includes a Mongolian dialect transcribing text semantic normalization submodule: The normalization submodule is specifically designed to eliminate the interference caused by the lexical variations of Mongolian dialects on cross-modal semantic matching. The Mongolian dialects include Oirat, Bargut, and Khalkha dialects, in order to achieve semantic alignment between dialect speech content and the standard Mongolian text library. The normalization submodule is implemented through a sequence conversion model, which adopts an encoder-decoder architecture. At the encoding end, contextual information is fused, and at the decoding end, standard Mongolian written language is generated. The model is trained on parallel corpora of Mongolian dialect spoken language and standard Mongolian written language. Using contextual information, the transcribed text containing dialect feature words is converted into text that conforms to the standard Mongolian written language norm.

5. The Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The Mongolian cross-modal semantic mapping module adopts a dual-tower contrastive learning architecture, including a content tower and a query tower: The content tower contains three parallel modality-specific coding branches: a text coding branch, an image coding branch, and a speech coding branch. The text coding branch uses a pre-trained language model enhanced with Mongolian cultural knowledge, the image coding branch uses a visual Transformer architecture, and the speech coding branch uses an acoustic feature extraction model. The outputs of each branch are mapped to the same dimension through a linear projection layer, and then fused through a learnable weighted fusion layer to obtain a content vector; The query tower contains a text encoder with the same text encoding branch structure as the content tower, whose input is the user query text and output is a query vector; The content tower and query tower are optimized through comparative learning during the training phase, so that different modal vectors of the same semantic content are close in semantic space.

6. The Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The Mongolian cross-modal semantic mapping module includes a pre-trained language model enhanced with Mongolian cultural knowledge: The pre-trained language model is trained using a strategy that includes a Mongolian cultural term masking prediction task. The task involves randomly masking terms in the input text that belong to a pre-built Mongolian cultural entity library and using contextual information to train the model to predict the masked entities. This injects structured knowledge of Mongolian historical and cultural entities into the model parameters, enabling the model to learn the semantic representation of cultural entities. The semantic mapping module uses a loss function during training that includes a Mongolian cultural semantic enhancement term. This term is used to identify shared Mongolian cultural entities between sample pairs during model training and thereby enhance their vector association strength in a unified semantic space.

7. A Mongolian cross-modal content semantic retrieval and intent matching system according to claim 6, characterized in that, The cross-modal self-supervised contrastive learning strategy specifically includes: Using Mongolian OCR text, standard transcribed text, and corresponding audio readings derived from the same Mongolian ancient text, image-text-speech triples are constructed as positive sample pairs for self-supervised training. The strategy employs a multi-stage training approach: In the first stage, a basic Mongolian pre-trained language model is trained using a masked language modeling task with a large-scale Mongolian single-modal corpus; in the second stage, the text encoder is fixed, and the image encoder and speech encoder are trained using the triplet data with a contrastive learning task, aligning their output vectors to the corresponding text vectors; in the third stage, all encoders are jointly fine-tuned, and a cross-modal alignment loss function for Mongolian cultural entities is introduced. When calculating negative samples within a batch, this loss function imposes a stricter penalty on negative sample pairs containing the same Mongolian cultural terms, forcing the model to learn deeper Mongolian multimodal semantic associations beyond shallow features.

8. A Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, The dedicated parsing module for Mongolian language retrieval intent includes a sub-module for classifying Mongolian language cultural intent. The cultural intent classification submodule can identify Mongolian-characteristic cultural retrieval intents, which include Mongolian ancient book retrieval, Mongolian epic query, and Mongolian historical figure query. The cultural intent classification submodel uses a text convolutional neural network or a lightweight Transformer and is trained on a Mongolian query dataset labeled with cultural intents. The cross-modal semantic matching module is configured to execute a dynamic differential thresholding strategy, setting independent similarity filtering thresholds for each modality candidate result based on the Mongolian cultural intent classification results and the differences in feature quality and reliability of different modalities.

9. A Mongolian cross-modal content semantic retrieval and intent matching system according to claim 1, characterized in that, It also includes a Mongolian scene feedback optimization module: The Mongolian scene feedback optimization module is used to collect in-depth interaction data of Mongolian scenes, including the duration of in-depth browsing of Mongolian ancient book images and the annotation behavior of specific Mongolian cultural content. The feedback optimization module is configured to use the deep interaction data as high-quality positive samples to perform small-sample, low-perturbation incremental parameter updates on the Mongolian cross-modal semantic mapping module at a learning rate far lower than that of standard training. The incremental parameter updates employ elastic weight consolidation or knowledge distillation techniques to avoid catastrophic forgetting. The feedback optimization module is also configured to trigger an update of the rule base used by the Mongolian multimodal preprocessing module based on the user's direct error correction behavior on the results of variant character recognition and dialect transcription.

10. A Mongolian cross-modal content semantic retrieval and intent matching system according to claim 9, characterized in that, The pre-trained models used in the Mongolian cross-modal semantic mapping module and the Mongolian retrieval intent-specific parsing module are lightweight Mongolian-specific models obtained through model compression and knowledge transfer. The lightweight model is obtained from a large pre-trained model containing Mongolian cultural knowledge through a knowledge transfer method, which includes knowledge distillation, model pruning, or parameter quantization. The knowledge transfer process focuses on enabling the lightweight model to inherit and maintain its understanding and representation ability of Mongolian cultural load words and unique semantic structures. The lightweight model is specifically designed based on the characteristics of the Mongolian language and the requirements of the target deployment environment. It features reduced model complexity and number of parameters, and is optimized through model compression technology to adapt to efficient deployment and real-time inference in computing-constrained environments.