Method for query enhancement based on hierarchical knowledge base

CN122240868APending Publication Date: 2026-06-19WUHAN TEXTILE UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN TEXTILE UNIV
Filing Date
2026-01-29
Publication Date
2026-06-19

Smart Images

  • Figure CN122240868A_ABST
    Figure CN122240868A_ABST
Patent Text Reader

Abstract

This invention discloses a query enhancement method based on a hierarchical knowledge base, relating to the field of image and text retrieval technology. The method mainly includes: encoding a combined query to obtain initial text and image vectors; encoding and normalizing a hierarchical knowledge base containing category, attribute, and style layers to obtain a vector database; calculating similarity to obtain candidate semantic tags; generating semantically enhanced query text using a large language model; and combining the initial text and image vectors through encoding and fusion to obtain an enhanced query representation for combined image retrieval tasks. Implementing the query enhancement method based on a hierarchical knowledge base provided by this invention can enhance the system's semantic understanding, robustness, and generalization ability while maintaining retrieval efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image and text retrieval technology, and more specifically, to a query enhancement method based on a hierarchical knowledge base. Background Technology

[0002] With the rapid development of internet technology and e-commerce, image retrieval technology has become a research hotspot in the field of information retrieval. Traditional image retrieval methods mainly rely on content-based image retrieval (CBIR), which extracts low-level visual features such as color, texture, and shape from images for similarity matching. However, these methods suffer from the "semantic gap" problem, meaning that it is difficult to establish an effective mapping between low-level visual features and high-level semantic concepts, resulting in retrieval results that often do not match the user's intent. In recent years, with breakthroughs in deep learning technology, image representation learning methods based on convolutional neural networks (CNNs) and visual transformers have significantly improved the performance of image retrieval, but these methods still mainly focus on single-modal image queries.

[0003] In real-world applications, users often need to express their search intent in more flexible and precise ways. For example, on e-commerce platforms, users might want to find "similar styles to this dress but in brighter colors," or in fashion recommendation systems, they might search for "shoes similar to these but more suitable for formal occasions." This demand has given rise to composed image retrieval technology, which allows users to provide both reference images and text descriptions as search criteria, achieving more accurate retrieval through multimodal information fusion. Composed image retrieval not only requires understanding the visual content of the image but also accurately parsing the modification intent expressed in the text and finding the target image in the search space that satisfies both constraints, thus presenting a greater technical challenge.

[0004] While deep learning-based retrieval methods have achieved significant results in standard benchmarks, they often exhibit severe vulnerability to real-world query variations. User-input queries in practice may contain spelling errors, colloquial expressions, synonyms, word order changes, and other variations. Although these query variants are semantically similar, their representations in the feature space can shift significantly, leading to a sharp decline in retrieval performance. More seriously, current mainstream intensive retrieval methods employ a dual-encoder architecture, lacking cross-modal interaction during the encoding phase and failing to leverage image information to aid text understanding. This allows query perturbations to directly impact the retrieval results.

[0005] This vulnerability stems from two main reasons. First, while pre-trained models learn rich linguistic knowledge, their training corpora are primarily standardized texts, limiting their robustness to non-standard forms. Second, retrieval models often lack sufficient domain knowledge, making it difficult to make reasonable inferences when queries are incomplete or ambiguous. For example, when a user queries "find a more elegant dress," the model may fail to accurately understand the multi-dimensional attribute requirements implied by "elegant," such as material, cut, and color, leading to a discrepancy between the search results and the user's expectations.

[0006] How to overcome the shortcomings of data-driven methods and enhance the semantic understanding and robustness of the system while maintaining retrieval efficiency is an urgent problem to be solved. Summary of the Invention

[0007] The purpose of this invention is to provide a query enhancement method based on a hierarchical knowledge base, which can enhance the semantic understanding, robustness, and generalization ability of the system while maintaining retrieval efficiency.

[0008] This invention provides a query enhancement method based on a hierarchical knowledge base, comprising the following steps:

[0009] S1: Based on the combined query consisting of the reference image and the modified text, the initial text vector and image vector are obtained using a sentence encoder and a pre-trained CLIP model; S2: Construct a hierarchical knowledge base containing category, attribute, and style layers; encode and normalize the hierarchical knowledge base to obtain a vector database. S3: Calculate the similarity between the image vector and the knowledge vector in the vector database, and return the Top-K semantic units with the highest similarity to obtain candidate semantic labels; S4: Input the modified text, candidate semantic tags, and corresponding standard descriptions from the hierarchical knowledge base into the large language model for semantic enhancement, and generate semantically enhanced query text; S5: Use a sentence encoder to encode the semantically enhanced query text to obtain an enhanced query text vector. Encode and fuse the enhanced query text vector, the initial text vector, and the image vector to obtain an enhanced query representation. S6: Enhanced query representations are used to combine image retrieval tasks to improve retrieval accuracy and robustness.

[0010] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the query enhancement method based on a hierarchical knowledge base described above.

[0011] Implementing the query enhancement method based on a hierarchical knowledge base provided by this invention has the following beneficial effects: This invention utilizes a sentence encoder and a pre-trained CLIP model to obtain initial text and image vectors based on a combined query consisting of a reference image and modified text. A hierarchical knowledge base comprising category, attribute, and style layers is constructed, and this knowledge base is encoded and normalized to obtain a vector database. The image vectors are compared with the knowledge vectors in the vector database to calculate similarity, and the Top-K semantic units with the highest similarity are returned to obtain candidate semantic labels. The modified text, candidate semantic labels, and corresponding standard descriptions from the hierarchical knowledge base are input into a large language model for semantic enhancement, generating semantically enhanced query text. The sentence encoder is used to encode the semantically enhanced query text to obtain enhanced query text vectors. These enhanced query text vectors, initial text vectors, and image vectors are encoded and fused to obtain an enhanced query representation. This enhanced query representation is then used for combined image retrieval tasks. This invention leverages the complementarity of the knowledge base and query enhancement, enabling them to work synergistically for optimal performance. By introducing the adversarial training and few-shot learning strategies proposed in this embodiment, the invention overcomes the severe impact of query variants on retrieval performance. The model's performance on query variants is significantly improved, with a much smaller performance drop, effectively helping the model learn more fundamental semantic features rather than superficial word matching. In summary, this invention can enhance the semantic understanding, robustness, and generalization ability of the system while maintaining retrieval efficiency, thereby improving the effectiveness of image query retrieval. Attached Figure Description

[0012] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings: Figure 1 This is a flowchart of the query enhancement method based on a hierarchical knowledge base provided by the present invention; Figure 2 This is a performance comparison diagram on three datasets provided by the present invention; Figure 3 This is a schematic diagram of the ablation experiment results of the FashionIQ dataset provided by this invention. Detailed Implementation

[0013] To provide a clearer understanding of the technical features, objectives, and effects of the present invention, specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0014] Figure 1 A schematic diagram of the query enhancement method based on a hierarchical knowledge base according to this embodiment is shown. In this embodiment, the query enhancement method based on a hierarchical knowledge base includes the following steps: S1: Based on the combined query consisting of the reference image and the modified text, the initial text vector and image vector are obtained using a sentence encoder and a pre-trained CLIP model; In an exemplary embodiment, step S1 specifically includes: encoding the modified text in the combined query using a sentence encoder to obtain an initial text vector; and extracting features from the reference image in the combined query using a trained CLIP model image encoder to obtain an image vector. In one exemplary embodiment, the sentence encoder is a Sentence-BERT model, which introduces a Siamese network structure and a contrastive learning objective on the basis of the BERT model. By minimizing the distance between similar sentence pairs and maximizing the distance between dissimilar sentence pairs, it learns high-quality sentence embeddings to encode the user-input query statement into a vector.

[0015] In one exemplary embodiment, the training method of the pre-trained CLIP model includes: training the CLIP model using a loss function based on the query variant to obtain the pre-trained CLIP model; In one exemplary embodiment, the loss function is calculated as follows:

[0016] in, This represents the total loss function; For standard retrieval loss; The consistency loss is calculated between the features of the original query and the variant query. This is the balance coefficient.

[0017] As an exemplary embodiment, the consistency loss between the original query and variant query features is typically expressed as mean squared error or cosine distance; by adjusting... The value of can control the intensity of adversarial training.

[0018] As an example, the CLIP (Contrastive Language-Image Pre-training) model is a large-scale vision-language pre-training model proposed by OpenAI. It learns the aligned representation of vision and language by training on 400 million image-text pairs through contrastive learning.

[0019] S2: Construct a hierarchical knowledge base containing a category layer, an attribute layer, and a style layer; encode and normalize the hierarchical knowledge base to obtain a vector database. In one exemplary embodiment, the category layer is used to describe the clothing type, the attribute layer is used to describe at least one of color, material, length, sleeve type, and collar type, and the style layer is used to describe the overall aesthetic style. In one exemplary embodiment, the calculation formula for encoding and normalizing the hierarchical knowledge base is as follows: , in, This is the encoded knowledge vector; Indicates L2 normalization; This refers to the Sentence-BERT encoder; Indicates the first Knowledge text; This represents a collection of knowledge bases.

[0020] S3: Calculate the similarity between the image vector and the knowledge vector in the vector database, and return the Top-K semantic units with the highest similarity to obtain candidate semantic labels; In one exemplary embodiment, the formula for calculating the similarity is: , in, The query vector represented by the image vector. With knowledge vectors Similarity; Represents the L2 norm; This represents the k semantic units with the highest similarity returned; Indicates in the knowledge base set Find the independent variable that makes the function reach its maximum value. S4: Input the modified text, candidate semantic tags, and corresponding standard descriptions from the hierarchical knowledge base into the large language model for semantic enhancement, and generate semantically enhanced query text; In one exemplary embodiment, step S4 further includes: when the large language model generation fails or is unavailable, generating semantically enhanced query text using a traditional query reconstruction method based on keyword concatenation; S5: Encode the semantically enhanced query text using a sentence encoder to obtain an enhanced query text vector, and then encode and fuse the enhanced query text vector, the initial text vector, and the image vector to obtain an enhanced query representation; S6: Use the enhanced query representation to combine image retrieval tasks to improve retrieval accuracy and robustness.

[0021] In some embodiments, the query enhancement method based on hierarchical knowledge base described above can also be implemented in the following ways.

[0022] In this embodiment, the query enhancement method based on a hierarchical knowledge base mainly consists of three stages: query understanding, feature encoding, and retrieval ranking, as detailed below: Step 1. User Input Simple query + image: For example, a user enters a picture and the caption: Does this dress suit me? Text encoding: The user text "Does this skirt suit me?" is encoded using Sentence-BERT to obtain the initial text vector: q_text; Image encoding: The CLIP image encoder is used to extract features from the input skirt image to obtain the image vector: q_image; Step 2. First-stage query Leveraging CLIP's cross-modal capabilities, image features are matched with text vectors from a three-layer knowledge base to calculate: Top-K semantic units, including: Category layer: "dress / skirt / A-line skirt"; Attribute layer: "length: mid-length", "material: chiffon", "color: light"; Style layer: "gentle style / commuter style / French style". Step 3. Query Enhancement The following information is integrated into a structured input into LLMQwen2.5-1.5B: 1) the user's original text; 2) candidate semantic labels (hierarchical) identified from the image; 3) the corresponding standard description in the knowledge base; For example: User input: "Does this dress suit me?"; Image semantics: Category: Dress; Attributes: Mid-length, Light color, Lightweight fabric; Style: Gentle, Everyday commuting; Output: Query text: Gentle mid-length light-colored dress, suitable for everyday commuting and casual occasions, with a simple and elegant style; Step 4. Re-encode the enhanced text generated by LLM using Sentence-BERT; Encoding fusion: Original text vector + Enhanced text vector + Image vector; Step 5. Two-stage retrieval: The enhanced query representation input to the clip yields q_text_enhance; the image input to the clip yields q_img; the projection layer aligns the dimensions to a multimodal space consistent with the offline database; similarity is calculated to return user-related text and image information.

[0023] It should be noted that the three hierarchical knowledge bases (category layer, attribute layer, and style layer) convert textual knowledge into dense vectors using Sentence-BERT (which requires pre-training). Sentence-BERT, as a sentence encoder, introduces a Siamese network structure and a contrastive learning objective on top of BERT. It learns high-quality sentence embeddings by minimizing the distance between similar sentence pairs and maximizing the distance between dissimilar sentence pairs. The objective is to encode user-input queries into vectors. This embodiment uses Qwen2.5-1.5B for intelligent expansion of the large language model.

[0024] In some embodiments, the query enhancement method based on hierarchical knowledge base described above can also be implemented in the following ways.

[0025] The combined image retrieval task aims to retrieve the target image that best matches the user's intent from a set of candidate images based on a combined query of a reference image and modified text. Formally, given a reference image... And modify text A combined query can be represented as:

[0026] in, This is a combination function responsible for fusing visual and linguistic information. The retrieval goal is to find the image in the candidate image set C that is most similar to the combined query, such that the similarity function... Maximizing. Unlike unimodal retrieval, combined image retrieval requires simultaneously understanding visual content and textual modification intent, and accurately locating the target that satisfies both constraints in the semantic space. In the fashion industry, modified text typically expresses attribute changes relative to a reference image, such as "more vibrant colors," "more formal styles," or "shorter sleeves." This relative modification requires the model not only to understand the concept of absolute attributes but also to model the relative relationships between attributes.

[0027] To address the aforementioned issues, this embodiment proposes a query enhancement method based on a hierarchical knowledge base, which compensates for the shortcomings of data-driven methods by introducing domain knowledge. The core idea includes three aspects: First, constructing a domain knowledge base with a three-layer structure of category, attribute, and style to systematically organize professional knowledge in the fashion field; second, designing an intelligent query enhancement strategy, prioritizing the use of a large language model for semantic expansion while retaining traditional keyword extraction and knowledge fusion as backup solutions; third, introducing a robustness enhancement mechanism to enable the model to adapt to various query variations. This method significantly enhances the system's semantic understanding ability and robustness while maintaining retrieval efficiency.

[0028] 1. Construction of a layered knowledge base (1) Category-level knowledge Category-level knowledge defines the basic types of clothing and their feature descriptions, providing top-level semantic constraints for retrieval. For the three datasets involved in this embodiment, category-level knowledge covers major categories such as dresses, shirts, toptee shirts, jackets, pants, skirts, and shoes. Each category is associated with a set of descriptive text, explaining its definition, common features, and typical uses. For example, the knowledge description for dresses includes "a one-piece women's garment, typically with a defined waistline and skirt" and "suitable occasions include formal occasions, casual parties, and everyday wear." The construction of category-level knowledge follows the principles of coverage, conciseness, and distinguishability, with each category description limited to 3-5 core statements, emphasizing the differences between categories.

[0029] (2) Attribute layer knowledge The attribute layer describes the specific features of clothing and is the core layer of the knowledge base. This layer includes fine-grained attributes such as color, material, fit, length, sleeve type, and collar type. Taking color as an example, the knowledge base includes 15 common colors: black, white, red, blue, pink, gray, brown, green, yellow, purple, orange, dark blue, beige, cream, and silver, along with their characteristic descriptions. Material attributes cover cotton, silk, chiffon, satin, denim, leather, wool, etc., with each material associated with its texture, seasonal suitability, and care requirements. Fit attributes include tight, loose, fitted, casual, slim, and oversized, describing the degree of fit of the clothing. Length attributes vary depending on the category; for example, dress lengths include short, mid-length, long, mini, knee-length, and ankle-length. Sleeve type attributes include sleeveless, short-sleeved, long-sleeved, three-quarter sleeve, and puff sleeve. This fine-grained attribute knowledge provides rich semantic resources for understanding and generating attribute-related query descriptions.

[0030] (3) Style layer knowledge Style knowledge describes the overall aesthetic style and suitability for various occasions of clothing, representing a higher level of semantic abstraction. Major style types include formal, casual, elegant, fashionable, retro, modern, and classic. Each style is associated with multiple descriptive knowledge points. For example, formal style knowledge includes "suitable for business and special occasions," "typically uses simple lines and minimal embellishments," and "colors tend towards neutral tones such as black, dark blue, and gray." Casual style knowledge includes "suitable for everyday activities," "emphasizes comfort and a relaxed fit," and "styles include jeans, T-shirts, and sneakers." Style knowledge plays an integrative role in query enhancement, associating multiple attribute features with a unified style theme, achieving multi-level semantic modeling from concrete to abstract and from partial to holistic.

[0031] (4) Knowledge encoding and indexing Knowledge encoding transforms textual knowledge into dense vector representations, supporting efficient semantic similarity retrieval. This embodiment employs the Sentence-BERT model to encode all knowledge entries. This model, optimized through a Siamese network structure and contrastive learning, generates high-quality sentence embeddings. The encoding process is as follows: Each knowledge text in the knowledge base is input into Sentence-BERT, and the output representation, labeled with [CLS], is used as a sentence vector. Then, L2 normalization is applied to all knowledge vectors. The encoding formula can be expressed as:

[0032] in, This represents the i-th knowledge text. This represents a knowledge base set; SBERT stands for Sentence-BERT encoder. These are the encoded knowledge vectors. After encoding, all knowledge vectors are stored in a vector database, supporting fast nearest neighbor search. The knowledge base used in this embodiment contains approximately 200 knowledge entries, covering the main concepts and attribute descriptions in the fashion field, with an encoding dimension of 768 dimensions.

[0033] 2. Intelligent Query Enhancement Strategy (1) Keyword extraction Keyword extraction is the first step in query understanding, aiming to identify key attribute words from the original query. This embodiment constructs a keyword dictionary containing seven categories: color, sleeve type, length, style, material, pattern, and fit, with 10-20 common words in each category. Keyword extraction employs an exact matching strategy. After converting the query text to lowercase, all keywords in the dictionary are traversed, and successfully matched words and their categories are recorded. In addition, comparative words, including patterns such as "more," "less," and the "-er" suffix, are extracted. These comparative words indicate that the query expresses relative modifications rather than absolute attributes, which is particularly significant for combined image retrieval tasks. The results of keyword extraction are used to guide the direction of knowledge retrieval and assist in query reconstruction.

[0034] (2) Knowledge retrieval methods Knowledge retrieval retrieves relevant knowledge entries from a knowledge base based on the query text, providing semantic supplementation for query enhancement. The retrieval process is as follows: First, the query text is encoded using Sentence-BERT to obtain the query vector q; then, the cosine similarity between the query vector and all knowledge vectors is calculated; finally, the Top-K knowledge items with the highest similarity (K is set to 5) are returned. The similarity calculation and retrieval formulas are:

[0035] The retrieved knowledge is sorted by similarity, and this similarity score is used as a weight for subsequent fusion. To improve retrieval accuracy, this embodiment also introduces a hierarchical retrieval strategy: first, the knowledge level (category level, attribute level, or style level) is determined based on the extracted keywords, and then retrieval is performed within the corresponding level. Compared to global retrieval, this strategy can more accurately locate relevant knowledge while reducing the retrieval space and improving efficiency.

[0036] (3) LLM-based query enhancement Large language models, with their powerful language understanding and generation capabilities, offer more intelligent solutions for query enhancement. This embodiment uses a lightweight LLM as the query enhancement module. The core idea is to use the original query, category information, and retrieved knowledge as context to prompt the LLM to generate enhanced queries. The prompt template design follows the principle of simplicity, requiring the LLM to add 1-3 relevant keywords while maintaining the original semantics. The LLM generation parameters are set as follows: maximum number of newly generated tokens 50, temperature coefficient 0.3, and a greedy decoding strategy. To ensure the quality of enhancement, length constraints and validity checks are set: the length of the enhanced query cannot exceed twice that of the original query and must contain the core keywords of the original query. If the generated result does not meet the constraints, it will fall back to the traditional method.

[0037] (4) Query Reconstruction Algorithm Traditional query reconstruction methods serve as a backup solution for LLM enhancement, activated when LLM is unavailable or generation fails. The reconstruction algorithm adheres to the principle of concise enhancement, avoiding the introduction of redundant information. The specific steps are as follows: First, extract keywords from the top-5 retrieved knowledge items, with a maximum of two keywords extracted from each item; second, filter keywords already present in the original query, retaining newly added information; third, sort by knowledge similarity, selecting the top three keywords as enhancement terms; fourth, append the enhancement terms to the end of the original query to form the enhanced query. The reconstruction process can be represented as:

[0038] in, For the original query, This indicates a splicing operation. These are the extracted keywords. To ensure the rationality of the enhancement, a length protection mechanism is also set: if the length of the enhanced query exceeds twice that of the original query, only the most important keyword is added. Although traditional methods are not as flexible as LLM, they have the advantages of high stability and low computational cost.

[0039] 3. Enhanced query robustness (1) Analysis of query variant problems In real-world applications, user-input queries are often not standardized text but contain various variations. These query variations can be categorized as follows: First, spelling errors, such as misspelling "stop" as "stpp"; second, non-standard punctuation, such as randomly inserting commas or periods; third, grammatical simplification, such as omitting stop words; fourth, word order changes, such as writing "drinking stop" instead of "stop drinking"; fifth, tense changes, such as changing "drink" to "drinking"; and sixth, synonym substitution, such as replacing "consume" with "drink". While these variations may be semantically similar to the original query, they can introduce significant shifts in the feature space, leading to a decrease in the quality of search results.

[0040] To systematically evaluate the model's adaptability to query variants, this embodiment designs seven query variant generation methods: MISSPELL (spelling errors), EXTRAPUNC (extra punctuation), NOSTOPWORD (stop word removal), SWAPWORDS (word order transformation), TRANSTENSE (tense transformation), SWAPSYN-GLOVE (GloVe-based synonym replacement), and SWAPSYN-WNET (WordNet-based synonym replacement). These variant generation methods cover the most common query quality issues in real-world scenarios, providing a foundation for robust training and evaluation of the model.

[0041] (2) Robust training based on adversarial thinking To enhance the model's adaptability to query variants, this embodiment introduces the concept of adversarial learning during training. The core idea of ​​adversarial learning is to force the model to learn more robust feature representations by adding perturbations to the input data. Specifically, during the training phase, for a subset of samples, slight perturbations are added to their query text to generate query variants. The model is then required to produce consistent outputs on both the original and perturbated queries. This consistency constraint prevents the model from overfitting to standardized query representations and instead allows it to learn more fundamental semantic features.

[0042] At the implementation level, adversarial training employs the following strategies: First, a subset of samples (approximately 30%) is randomly selected from the training set, and one of the seven variant generation methods mentioned above is applied to their query text; second, the original query and variant queries are encoded separately, resulting in two feature vectors; third, a consistency loss term is added to the standard retrieval loss to encourage the two vectors to remain close in semantic space. This adversarial approach exposes the model to diverse query formats during training, thereby improving its generalization ability. The loss function can be expressed as a weighted sum of the standard retrieval loss and the adversarial consistency loss:

[0043] in, For standard retrieval loss, The consistency loss between the features of the original query and the variant query is typically expressed as mean squared error or cosine distance. This is the balance coefficient. It is adjusted... The value of can control the intensity of adversarial training. In the experiment, Setting it between 0.3 and 0.5 ensures both standard retrieval performance and enhanced robustness.

[0044] (3) Small sample learning strategy In practical applications, acquiring a large amount of high-quality query variant annotation data is often costly. To address the data scarcity problem, this embodiment incorporates the idea of ​​few-shot learning into its training strategy. The core concept of few-shot learning is to enable the model to quickly adapt to new tasks or data distributions from a small number of samples through meta-learning or transfer learning. Specifically, this embodiment adopts the following strategies: First, in the pre-training phase, the model learns basic retrieval capabilities on a large-scale standard query dataset; second, in the fine-tuning phase, a small number of manually annotated query variant samples (approximately 400 samples per dataset) are introduced to quickly adapt to variant patterns through adversarial training; third, in the inference phase, query augmentation strategies are used to correct potentially problematic queries into more standard forms.

[0045] The advantage of this few-shot learning strategy is that it eliminates the need to collect large amounts of training data for each variant type; a small number of representative samples are sufficient to effectively improve the model's generalization ability. In practice, this embodiment extracts approximately 400 samples from each of the FashionIQ, Fashion200K, and Shoes datasets for variant annotation. Compared to completely re-annotating the datasets, the annotation cost is reduced by an order of magnitude. This efficient training strategy makes the robustness enhancement method highly practical.

[0046] 4. Experimental verification (1) Experimental setup To verify the effectiveness of the hierarchical knowledge base and query augmentation strategy, systematic experiments were conducted on three datasets. The experiments used CLIP as the basic encoder, and the evaluation metrics included Recall@K and nDCG@K. Recall@K measures whether the target image appears in the top K retrieval results, while nDCG@K considers the influence of ranking position, giving higher weight to correct results that appear earlier. The experiments were conducted on a single RTX 3090 GPU, with each setting repeated three times and the average value taken to ensure the stability of the results.

[0047] (2) Analysis of main results Experimental results demonstrate that the hierarchical knowledge augmentation method proposed in this embodiment achieves significant performance improvements on all three datasets. Taking the FashionIQ dataset as an example, compared to the CLIP baseline model (Recall@10 of 16.93%), the method in this embodiment achieves 21.57%, an absolute improvement of 4.64 percentage points and a relative improvement of 27.42%. On the Fashion200K dataset, Recall@10 increases from the baseline of 11.15% to 14.67%, an improvement of 3.52 percentage points. On the Shoes dataset, Recall@10 increases from 14.13% to 18.54%, an improvement of 4.41 percentage points. These results validate the effectiveness and generalization ability of the hierarchical knowledge augmentation method across different datasets and categories.

[0048] like Figure 2 The image shows a performance comparison across three datasets.

[0049] Further ablation experiments analyzed the individual contributions of knowledge base and query augmentation. Three configurations were designed: knowledge base augmentation only, query augmentation only, and the complete method (combining both). Results showed that on the FashionIQ dataset, the knowledge base-only method improved performance by approximately 2.67 percentage points (a relative improvement of 14.12%) compared to the baseline, the query augmentation-only method improved performance by approximately 2.33 percentage points (a relative improvement of 12.13%), and the complete method improved performance by 4.64 percentage points. This indicates that knowledge base and query augmentation are complementary, and their combination can achieve optimal performance.

[0050] like Figure 3 The results of the ablation experiment on the FashionIQ dataset are shown.

[0051] (3) Robustness assessment of query variants To evaluate the model's robustness to query variants, it was tested on seven query variants. Experiments compared the performance differences between the standard BERT-Base and RoBERTa-Base models on original and variant queries. Results showed that without robustness enhancement, the BERT-Base model experienced an average decrease of 27.69% in nDCG@10 across the seven variants, while RoBERTa-Base experienced a decrease of 38.41%, validating the significant impact of query variants on retrieval performance. Introducing the adversarial training and few-shot learning strategies proposed in this embodiment significantly improved the model's performance on query variants, greatly reducing the performance degradation and demonstrating the effectiveness of the robustness enhancement strategy.

[0052] Different types of variants have varying degrees of impact on the model. Spelling errors and synonym substitutions have the greatest impact on performance because these variants directly alter the vocabulary itself, leading to significant changes in word vector representations. Word order changes and stop word removal have relatively smaller impacts because the model primarily relies on key content words for semantic understanding. Through adversarial training, the model's robustness to all types of variants improves, particularly in terms of synonym substitution and spelling errors. This indicates that adversarial learning effectively helps the model learn more fundamental semantic features, rather than superficial word matching.

[0053] Experiments on the FashionIQ, Fashion200K, and Shoes datasets validated the effectiveness and robustness of the proposed method. Experimental results show that the method in this embodiment achieves an average improvement of approximately 4 percentage points in the Recall@10 metric compared to the baseline, while also demonstrating good robustness across seven query variants. These technological innovations provide a solid foundation for the system implementation and performance optimization discussed in subsequent chapters.

[0054] In some embodiments, the query enhancement method based on hierarchical knowledge base described above can also be implemented in the following ways.

[0055] In this example, a user wants to buy a summer dress that fits their body type and has attached a selfie to their search, writing, "I'm looking for a light summer dress that fits me." The user's search is vague because they haven't specified the color, length, or style, but hope to find clothing that suits their body type and style.

[0056] The query input includes: Image: A user uploaded a selfie; Text: "I'm looking for a light and airy summer dress that suits me." Keyword extraction (first step): The keyword extraction module identified the following keywords: Category: Dress; Material: Lightweight; Season: Summer; Style: Simple and Fresh; Knowledge retrieval (step two): Sentence-BERT encoding: Encode the query text "suitable for my light summer dress" to obtain the query vector q\mathbf{q}q.

[0057] Cosine similarity calculation: By retrieving the features of clothing samples pre-stored in the knowledge base, the top-K knowledge items most similar to the query vector are returned, assuming there are 5 items.

[0058] Based on the matching degree, the following relevant information can be obtained from the knowledge base: Style category: Simple, casual, A-line skirt; Material: Chiffon, mint green fabric; Suitable occasions: Everyday wear, travel and vacation; LLM-based query enhancement (step 3): Enhance queries using a lightweight LLM to generate supplementary query information: Enhanced query: "I need a light summer dress that suits me, preferably an A-line style, made of chiffon, suitable for everyday wear and vacations." LLM generates more specific and accurate descriptions based on the original query and related knowledge, ensuring semantic consistency and adding key attributes to the clothing.

[0059] Query Refactoring (Backup Plan When LLM is Unavailable): If LLM fails to generate an enhanced query, extract keywords from the retrieved Top-5 knowledge and enhance the existing query: Extracted keywords: A-line skirt, chiffon, everyday wear; Refactored query: "A-line skirt that suits me, chiffon fabric, suitable for everyday wear." Robustness Enhancement: In clothing queries, users may cause queries to fail due to non-standard input descriptions or variations (such as spelling errors, word order changes, etc.). For example, when a user enters "suitable lightweight dress," there may be spelling errors or missing key attributes. To handle this situation, the model enhances the robustness of queries to various variations through adversarial learning and few-shot learning strategies, ensuring that the system can correctly understand and match relevant clothing regardless of how the user expresses their query.

[0060] Final search: Coarse sorting: The most relevant clothing samples are selected by calculating vector similarity based on CLIP.

[0061] Fine-grained ranking: Applying a cross-modal attention mechanism, considering the fusion of images and text, and combining user-input selfies and enhanced queries, to accurately rank the clothing that best meets the user's needs.

[0062] result: Suppose the final Top-K results are as follows: a white chiffon A-line skirt, suitable for everyday wear and vacations, light and comfortable; a mint green chiffon dress, suitable for spring and summer, fresh and simple; a minimalist striped A-line skirt, suitable for everyday and casual occasions.

[0063] These results not only meet the user's basic query requirements, but also take into account the comprehensive analysis of images and query text, thus providing highly accurate recommendations.

[0064] This embodiment provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the query enhancement method based on a hierarchical knowledge base described above.

[0065] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims. All of these forms are within the protection scope of the present invention.

Claims

1. A method for query enhancement based on hierarchical knowledge base, characterized in that, Includes the following steps: S1: Based on the combined query consisting of the reference image and the modified text, the initial text vector and image vector are obtained using a sentence encoder and a pre-trained CLIP model; S2: Construct a hierarchical knowledge base containing a category layer, an attribute layer, and a style layer; encode and normalize the hierarchical knowledge base to obtain a vector database. S3: Calculate the similarity between the image vector and the knowledge vector in the vector database, and return the Top-K semantic units with the highest similarity to obtain candidate semantic labels; S4: Input the modified text, candidate semantic tags, and corresponding standard descriptions from the hierarchical knowledge base into the large language model for semantic enhancement, and generate semantically enhanced query text; S5: Encode the semantically enhanced query text using a sentence encoder to obtain an enhanced query text vector, and then encode and fuse the enhanced query text vector, the initial text vector, and the image vector to obtain an enhanced query representation; S6: Use the enhanced query representation to combine image retrieval tasks to improve retrieval accuracy and robustness.

2. The layered knowledge base based query enhancement method of claim 1, wherein, The sentence encoder is the Sentence-BERT model, which introduces a Siamese network structure and a contrastive learning objective on the basis of the BERT model. By minimizing the distance between similar sentence pairs and maximizing the distance between dissimilar sentence pairs, it learns high-quality sentence embeddings to encode the user's input query statement into a vector.

3. The layered knowledge base based query enhancement method of claim 1, wherein, The training method for the pre-trained CLIP model includes: training the CLIP model using a loss function based on the query variant to obtain the pre-trained CLIP model.

4. The layered knowledge base based query enhancement method of claim 3, wherein, The formula for calculating the loss function is as follows: , in, This represents the total loss function; For standard retrieval loss; The consistency loss is calculated between the features of the original query and the variant query. This is the balance coefficient.

5. The query enhancement method based on a hierarchical knowledge base according to claim 1, characterized in that, The category layer is used to describe the clothing type, the attribute layer is used to describe at least one of color, material, length, sleeve type, and collar type, and the style layer is used to describe the overall aesthetic style.

6. The query enhancement method based on a hierarchical knowledge base according to claim 1, characterized in that, The formulas for encoding and normalizing the hierarchical knowledge base are as follows: , in, This is the encoded knowledge vector; Indicates L2 normalization; This refers to the Sentence-BERT encoder; Indicates the first Knowledge text; This represents a collection of knowledge bases.

7. The query enhancement method based on a hierarchical knowledge base according to claim 1, characterized in that, The formula for calculating the similarity is: , in, The query vector represented by the image vector. With knowledge vectors Similarity; Represents the L2 norm; This represents the k semantic units with the highest similarity returned; Indicates in the knowledge base set Find the independent variable that makes the function reach its maximum value.

8. The query enhancement method based on a hierarchical knowledge base according to claim 1, characterized in that, Step S4 further includes: when the large language model generation fails or is unavailable, generating semantically enhanced query text using a traditional query reconstruction method based on keyword concatenation.

9. An application of the query enhancement method based on a hierarchical knowledge base as described in any one of claims 1-8, characterized in that, It is applied to clothing search queries.

10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the query enhancement method based on a hierarchical knowledge base as described in any one of claims 1-8.