Semantic expansion matching method and system based on domain synonym library

By constructing a domain-specific thesaurus and generating an extended query index, the system addresses the inaccuracy of e-commerce search systems when faced with diverse queries. This enables precise understanding and dynamic adaptation of user intent, thereby enhancing the intelligence level of the search system.

CN120910232BActive Publication Date: 2026-06-16SUZHOU BIG DATA GRP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUZHOU BIG DATA GRP CO LTD
Filing Date
2025-07-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing e-commerce search systems struggle to keep up with emerging vocabulary and domain-specific expressions when faced with diverse and colloquial user queries, resulting in insufficient comprehensiveness and accuracy of search results.

Method used

A domain-specific thesaurus is constructed. By obtaining user query keywords and context information, a query context vector is generated. The relevance score between the extended thesaurus and the query vector is calculated. An extended query index is generated through weighted processing. Semantic extension matching is performed in conjunction with the platform's service category system.

🎯Benefits of technology

It improves the system's ability to accurately match product and user queries, dynamically adapts to changes in user input, optimizes the accuracy and precision of query results, and enhances the intelligence level of the search system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120910232B_ABST
    Figure CN120910232B_ABST
Patent Text Reader

Abstract

The application discloses a semantic expansion matching method and system based on a domain synonym library, and relates to the technical field of data processing. The method comprises the following steps: acquiring a query keyword input by a user, and generating a query context vector according to the query keyword and user context information; if the query keyword does not belong to a category word in a platform service category system, determining a set of expansion synonyms matched with the query keyword from a pre-constructed domain synonym library; calculating the correlation score between each expansion synonym in the set of expansion synonyms and the query context vector, and obtaining a corresponding weighted expansion synonym through weighting processing; and generating an expansion query index used for search matching according to each weighted expansion synonym and the query keyword. Thus, the dynamic expansion of the domain scene and the synonym library is combined, multi-dimensional semantic expansion can be performed when a complex and non-standardized user query is processed, and the intelligent level of the system is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a semantic expansion matching method and system based on a domain thesaurus. Background Technology

[0002] With the rapid development of e-commerce, the quantity and types of goods on online platforms are becoming increasingly diverse, leading users to place higher demands on the accuracy and intelligence of e-commerce search systems. In practical applications, users often use diverse and colloquial expressions to search for goods, resulting in multiple different descriptions of the same product. This diversification of user expressions presents greater challenges to the understanding and matching capabilities of existing search technologies.

[0003] Currently, mainstream e-commerce search systems generally rely on keyword matching, static dictionary expansion, or semantic models built on general corpora to associate queries with product information. While these methods can effectively handle standardized expressions or known vocabulary, they often struggle to promptly cover and respond to emerging terms, popular internet slang, or domain-specific expressions, affecting the comprehensiveness and accuracy of search results. Furthermore, as product information and user needs continue to evolve, traditional semantic matching methods still exhibit significant shortcomings in terms of dynamic adaptability and refined understanding. Summary of the Invention

[0004] This application provides a semantic expansion matching method, system, storage medium, computer program product, and electronic device based on a domain thesaurus, which at least solves the problem that e-commerce search systems in current related technologies cannot adequately understand the true query intent of users under diversified expressions.

[0005] In a first aspect, embodiments of this application provide a semantic expansion matching method based on a domain thesaurus. The method includes: obtaining query keywords input by a user, and generating a query context vector based on the query keywords and user context information; if the query keywords do not belong to category words in the platform service category system, determining an extended synonym set that matches the query keywords from a pre-built domain thesaurus; the domain thesaurus contains multiple domain scenarios and corresponding domain thesaurus tables, each of which records multiple sets of mapping relationships between non-category words and extended synonyms, wherein the extended synonyms are synonyms generated by semantically expanding category words that match non-category words in conjunction with context information of the corresponding domain scenario; calculating the relevance score between the extended synonym and the query context vector for each extended synonym in the extended synonym set, and obtaining corresponding weighted extended synonyms through weighted processing; and generating an extended query index for search matching based on each weighted extended synonym and the query keywords.

[0006] Secondly, embodiments of this application provide a semantic extension matching system based on a domain thesaurus. The system includes: an acquisition unit, configured to acquire user-input query keywords and generate a query context vector based on the query keywords and user context information; and a thesaurus set determination unit, configured to determine an extended set of synonyms matching the query keywords from a pre-built domain thesaurus if the query keywords do not belong to a category term in the platform service category system. The domain thesaurus contains multiple domain scenarios and corresponding domain thesaurus tables, each of which records multiple sets of terms related to non-category terms and their corresponding domain thesaurus terms. The mapping relationship between extended synonyms is expanded. The extended synonyms are generated by semantically expanding category words that match non-category words by combining them with contextual information of the corresponding domain scenario. Category words are standard vocabulary in the platform service category system. The context analysis unit is used to calculate the relevance score between each extended synonym in the extended synonym set and the query context vector, and obtain the corresponding weighted extended synonyms through weighted processing. The extended query generation unit is used to generate an extended query index for search matching based on each of the weighted extended synonyms and the query keywords.

[0007] Thirdly, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of the semantic expansion matching method based on a domain thesaurus according to any embodiment of the present application.

[0008] Fourthly, embodiments of this application provide a storage medium storing a computer program thereon, characterized in that, when the program is executed by a processor, it implements the steps of the semantic expansion matching method based on a domain thesaurus of any embodiment of this application.

[0009] Fifthly, embodiments of this application provide a computer program product, including a computer program / instructions, which, when executed by a processor, implement the steps of the semantic expansion matching method based on a domain thesaurus according to any embodiment of this application.

[0010] The semantic expansion matching method and system based on a domain thesaurus provided in this application can produce at least the following technical effects:

[0011] (1) By constructing a domain thesaurus and combining it with the platform service category system, corresponding extended synonyms are provided for non-category words. This not only effectively handles standardized query expressions, but also adapts to colloquial, diverse queries and emerging vocabulary, thereby improving the system's ability to accurately match product and user queries.

[0012] (2) By generating a query context vector based on query keywords and user context information, the system can dynamically adapt to changes in user input. It also introduces weighted processing of the relevance scores between extended synonyms and query context vectors, ensuring that the system can generate accurate extended query indexes based on the semantic relevance of different extended synonyms, thereby optimizing the accuracy and precision of query results.

[0013] This technical solution, by combining domain-specific scenarios with dynamic expansion of the thesaurus, enables multi-dimensional semantic expansion when handling complex and non-standardized user queries, thereby enhancing the system's intelligence level. When faced with new types of queries such as colloquial, ambiguous, and internet slang, the system can respond quickly and provide highly relevant product information, significantly enhancing the semantic understanding and processing capabilities of the search system. Attached Figure Description

[0014] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 A flowchart illustrating an example of a semantic expansion matching method based on a domain thesaurus according to an embodiment of this application is shown;

[0016] Figure 2 A flowchart illustrating an example of adaptively updating a domain thesaurus according to an embodiment of this application is shown.

[0017] Figure 3 A flowchart illustrating an example of updating context information of a domain scenario according to an embodiment of this application is provided.

[0018] Figure 4 A schematic diagram of an example interface interaction for managing a dataset for fine-tuning a large language model according to an embodiment of this application is shown.

[0019] Figure 5 A schematic diagram of an example interface interaction for fine-tuning a large language model according to an embodiment of this application is shown.

[0020] Figure 6A schematic diagram showing the experimental comparison of simulation results for an example of the quality of synonym expansion under different methods is presented.

[0021] Figure 7 A structural block diagram of an example of a semantic expansion matching system based on a domain thesaurus, according to an embodiment of this application, is shown. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0023] It should be noted that the commonly used semantic matching methods for e-commerce search mainly include the following three categories: keyword matching-based search technology, static thesaurus-based expansion technology, and word vector models trained on general corpora.

[0024] In keyword-based search technologies, the terms are typically segmented from the user query and product text, and then directly matched against a product database. While this method is simple to implement and fast, it inherently relies on literal consistency and cannot effectively identify synonyms, abbreviations, or emerging trending internet terms. This can lead to retrieval omissions or matching errors when there are discrepancies between the user's description and the product information.

[0025] In the extension technology based on static thesaurus, a simple semantic expansion between query terms and product information is achieved by pre-building a thesaurus. Although this technology can improve the semantic coverage to some extent, the thesaurus is not updated in a timely manner, making it difficult to cover rapidly changing hot words, new product category terms, and industry-specific terminology in the e-commerce field. This results in limited responsiveness of the system when facing emerging expressions, and it cannot meet users' dynamic search needs.

[0026] In word vector models trained on general corpora, such as GloVe (Global Vectors for Word Representation), semantic distance relationships between words are obtained through large-scale unsupervised learning, thereby achieving a certain degree of fuzzy matching and semantic expansion. However, such models are often difficult to deeply customize for e-commerce business scenarios, and have limited ability to handle domain-specific expressions, fine-grained product differentiation, and contextual semantic ambiguity. They are prone to semantic generalization or erroneous expansion, affecting actual retrieval performance.

[0027] It should be understood that the above description of the relevant technologies is intended only to help the public better understand the inventive spirit and motivation of this application, and is not intended to limit this application. Furthermore, the technical solutions described in the above-mentioned relevant technologies are not prior art, and may also be undisclosed technical solutions, such as those under research or in the laboratory stage.

[0028] The technical solutions in this application, including the collection, storage, use, processing, transmission, provision, and disclosure of users' personal information, comply with relevant laws and regulations and do not violate public order and good morals.

[0029] Figure 1 A flowchart illustrating an example of a semantic expansion matching method based on a domain thesaurus according to an embodiment of this application is shown.

[0030] Regarding the execution subject of the method in the embodiments of this application, it can be any controller or processor with computing or processing capabilities, so as to effectively cope with the diverse expressions, rapid changes and fine-grained matching requirements in the e-commerce environment, and realize accurate understanding and dynamic adaptation of user intent and product information.

[0031] In some examples, it can be integrated into an electronic device or terminal through software, hardware, or a combination of both, and the type of terminal or electronic device can be diverse, such as mobile phones, tablets, or desktop computers, etc.

[0032] like Figure 1 As shown, in step S110, the query keywords input by the user are obtained, and a query context vector is generated based on the query keywords and user context information.

[0033] It should be noted that the search keywords entered by users may be standardized product names, non-standardized colloquial expressions, or search terms with specific contexts. However, in the modern e-commerce environment, users often use vague, ambiguous, or polysemous expressions to search, so the system needs to understand and analyze the search terms from multiple perspectives.

[0034] Here, keyword extraction can be achieved through natural language processing techniques such as word segmentation and named entity recognition (NER) to segment the user's input query and obtain core query terms. For example, for the query "Apple phone price," the keywords "Apple phone" and "price" can be extracted. Contextual information may include the user's search history, the product categories on the current page, the user's purchasing preferences, and other personalized information. This contextual information helps the system better understand the user's intent and enhances the accuracy of semantic matching. For example, if the user has recently searched for "running shoes," then entering "shoes" indicates that the user is more likely looking for "running shoes" than other types of footwear.

[0035] Furthermore, by using word embedding technology (such as Word2Vec, GloVe, etc.), the user-input query keywords and their contextual information are transformed into a query context vector, which represents the semantic features of the query keywords and the related user intent, thus achieving a deep understanding of the user query.

[0036] In step S120, if the query keyword does not belong to the category words in the platform service category system, then the extended synonym set that matches the query keyword is determined from the pre-built domain thesaurus.

[0037] Once the system obtains the query keywords, the next step is to determine whether the keywords belong to a category within the platform's service category system. If the query keywords are keywords within a predefined category, a direct search and matching can be performed. However, for keywords that do not belong to a category, the system can search for extended synonyms from a pre-built domain thesaurus.

[0038] Here, the domain thesaurus contains multiple domain scenarios and corresponding domain thesaurus tables. Each domain thesaurus table is used to record multiple sets of mapping relationships between non-category words and extended synonyms. Extended synonyms are generated by semantically expanding category words that match non-category words with contextual information of the corresponding domain scenario.

[0039] It should be noted that the meaning of "domain scenario" can be diverse. Specifically, on the one hand, it can be a predefined or pre-configured explicit scenario type, referring to various product categories or user demand scenarios on the platform, such as tagged user groups or thematic usage scenarios like sneakers, students, business professionals, working from home, outdoor travel, and pet-owning families. On the other hand, it can also be defined or updated through various feature pattern analysis algorithms to support adaptive updates to new domains or new feature patterns. Therefore, when the query terms are ambiguous, semantic expansion can provide more matching terms.

[0040] Furthermore, by managing synonyms from different domain scenarios in the thesaurus, it is possible to ensure that the synonym expansion and query matching in each domain scenario can be optimized according to its unique context, without information interference between different domain scenarios, thus guaranteeing the efficiency and accuracy of synonym expansion and query matching.

[0041] In some implementations, to ensure that extended synonyms can be better recognized and effectively executed by search engines, extended synonyms may also include category terms. Furthermore, each extended synonym is not only intended for standardization of non-category terms, but also for better adaptation to specific domain scenarios. Taking the category "sports shoes" as an example of a domain scenario, "running shoes," "jogging shoes," and "training shoes" can all be considered different extended synonyms within the same category. Although they differ, they all belong to the category of sports shoes in this domain scenario, and therefore can be considered valid extended terms in that scenario.

[0042] Furthermore, by matching query fields, a set of synonyms with semantic similarity to the query keywords can be found from the domain thesaurus, thereby providing the search system with more matching options and enhancing the accuracy of the search.

[0043] For example, when a user searches for "versatile student shoes," the system first determines that "versatile" is a modifier, while "student shoes" or "shoes" as the primary keyword does not directly match specific category terms in the category system. Therefore, semantic matching is used to find various extended synonyms related to "student shoes" or "shoes," such as "student leather shoes," "sports student shoes," "casual student shoes," and "running shoes." Thus, when handling queries with incomplete matches, semantic expansion allows the query keywords to be extended to more possible synonymous expressions.

[0044] In step S130, for each extended synonym in the extended synonym set, the relevance score between the extended synonym and the query context vector is calculated, and the corresponding weighted extended synonym is obtained through weighted processing.

[0045] Here, after obtaining the extended synonym set, it is necessary to perform matching calculations between these synonyms and the query context vector to ensure that extended synonyms unrelated to the user preferences indicated by the user context are filtered out. The relevance score is used to express the degree of fit between the extended synonyms and the user context. It can employ various metrics, such as cosine similarity, Euclidean distance, and dot product, to quantify the relevance between each extended synonym and the user's query intent. A higher relevance score means a better match between the corresponding extended synonym and the user's query intent.

[0046] Then, based on the relevance score of each extended synonym, the system weights the extended synonyms. Specifically, extended synonyms with higher relevance scores are given higher weights, while extended synonyms with lower relevance scores are given lower weights, thus avoiding interference from redundant or irrelevant synonyms in the final search results.

[0047] Regarding the implementation details of step S130, in some examples of embodiments of this application, a gated network is used to calculate the relevance score between the extended synonyms and the query context vector. If the relevance score is less than or equal to a preset relevance threshold, the extended synonyms are filtered out; and if the relevance score exceeds the relevance threshold, the extended synonyms are weighted according to the relevance score to obtain corresponding weighted synonyms. Thus, the gated network evaluates the degree of fit between each extended synonym and the user's input query, filters out irrelevant synonyms, and simultaneously selects the most relevant extended synonyms and weights them according to relevance.

[0048] It should be noted that gating networks have the ability to selectively update and transmit information. They can adjust the weights of each extended synonym according to the input feature information to optimize the matching effect.

[0049] Both extended synonyms and query context vectors are mapped to the same vector space, and cosine similarity can be used to measure their similarity.

[0050]

[0051] In the formula, a is the query context vector, b is the extended synonym vector, and ||a|| and ||b|| are the magnitudes of vectors a and b, respectively.

[0052] The relevance score is obtained by calculating the cosine similarity between the query context vector and the extended synonym vector. This score reflects the degree of matching between the query context and the extended synonyms. The higher the score, the better.

[0053] Based on the relevance score, extended synonyms are filtered and weighted using a gating network. Specifically, a gating value is calculated to control the contribution of extended synonyms to the final result. This gating value is jointly determined by the query context vector and the relevance scores of the extended synonyms, and a gating value in the range [0,1] is generated using the Sigmoid function.

[0054]

[0055] In the formula, g(x) is the gating value. x is the relevance score between the extended synonym and the query context vector, which can be given by the corresponding cosine similarity. α is the adjustment parameter of the gating network, used to control the sensitivity of the gating mechanism.

[0056] Furthermore, the gate value g(x) can be used as the corresponding relevance score and compared with the preset relevance threshold to complete the screening of extended synonyms.

[0057] Specifically, the system pre-sets a relevance threshold θ (e.g., 0.7). If the relevance score of an extended synonym to the query context is less than or equal to the pre-set threshold, the synonym is considered to have a low degree of fit with the query intent and should be removed from the extended synonym set to avoid interfering with the search results. In this case, the corresponding extended synonyms will be filtered out, thereby effectively removing irrelevant or low-relevance extended synonyms, reducing the impact of irrelevant information on the final search results, and ensuring that the most relevant one or more synonyms can participate in search matching first.

[0058] Therefore, the weighting process takes into account the impact of user context information on the query. For example, if a user has searched for a specific category of goods (such as sneakers) multiple times in their historical search records, the relevant synonyms of that category will be weighted higher, while irrelevant extended synonyms (such as student leather shoes) can be filtered out, thereby ensuring that the relevance of the search results matches the user's long-term interests and behaviors.

[0059] In step S140, an extended query index for search matching is generated based on each weighted extended synonym and query keyword.

[0060] Here, the extended query index will include all weighted synonyms and their weights, which will be used as query conditions for the search engine to perform search matching. The generated extended query index ensures that when matching product information, the search engine not only considers the original query keywords, but also searches based on the semantic information expanded by the thesaurus, ensuring that the search engine can fully and accurately understand the user's query.

[0061] As a result, search engines can perform efficient product matching based on extended query indexes. Even when users enter keywords that are trending online or keywords that do not belong to a category, search engines can combine more context-related extended synonyms for search matching, greatly improving the coverage and accuracy of the search, providing users with more accurate search results, and enhancing the user experience.

[0062] In some examples of embodiments of this application, the weighted extended synonyms are sorted by an inverted index according to their relevance scores to form a weighted extended synonym set. Then, the query keywords are concatenated with the weighted extended synonym set to generate an extended query index for search matching.

[0063] Here, each weighted extended synonym is sorted in reverse order based on its relevance score to ensure that the most relevant synonyms have a higher priority during the search process.

[0064] Specifically, the weighted extended synonym set includes each extended synonym and its corresponding relevance score. The extended synonym set is then sorted in reverse order based on these relevance scores, ensuring that synonyms with higher relevance appear at the top. For example, the weighted extended synonym set after sorting is: {("casual student shoes", 0.92), ("sports student shoes", 0.87), ("running sports shoes", 0.80)}.

[0065] By using inverted sorting, the system will prioritize extended synonyms that are highly relevant to the query intent, while excluding synonyms with lower relevance. This ensures that the system returns products that best meet the user's query needs when matching queries.

[0066] After inverting the sorting, the system concatenates the query keyword "versatile student shoes" with the sorted weighted extended synonym set. This ensures that both the query term and all extended synonyms are considered during the search matching process. The final extended query index generated is as follows:

[0067] "Versatile Student Shoes" + {("Casual Student Shoes", 0.92), ("Sporty Student Shoes", 0.87), ("Running Shoes", 0.80)};

[0068] The concatenated query index not only includes the query term "versatile student shoes" but also multiple weighted extended synonyms. This ensures the search system fully considers the diversity and semantic expansion of query terms, allowing it to cover more potential product descriptions and categories, increasing the likelihood of search matching, and thus optimizing the coverage of search results. For example, when a user queries "versatile student shoes," it is a non-standard category term, which may prevent the search engine from fully understanding the user's query intent. However, by simultaneously considering extended synonyms such as "casual student shoes," "sports student shoes," and "running shoes," the comprehensiveness and relevance of search results are improved. Furthermore, the concatenated query index can also be sorted in reverse order based on weighted scores, ensuring that products retrieved using synonyms most relevant to the user's context are prioritized, improving the intelligence level of the search system.

[0069] It should be noted that the domain thesaurus in this embodiment differs from a regular thesaurus. A regular thesaurus typically only provides one-to-one or one-to-many keyword replacements, while a domain thesaurus is built for different domain scenarios (such as specific categories or contextual construction). It can provide more accurate extended synonyms for a specific product category, user group, or usage scenario. The domain thesaurus not only expands the number of synonyms but also takes into account the specific expression needs of the domain scenario. By comparing and matching with user context information, it can better meet user query needs, improve the relevance and accuracy of search results, and avoid matching deviations and misunderstandings caused by ordinary thesaurus neglecting context and domain scenario.

[0070] Furthermore, unlike matching searches that utilize "keyword matching + user profile," which primarily rely on historical user behavior to infer preferences, this approach is typically limited to expressions already known to the user and has weak matching capabilities for queries encountered for the first time or those rarely seen. Domain-based thesaurus-based matching methods proactively expand the scope of input queries, combining context for more accurate semantic understanding, thus obtaining more comprehensive search results without over-reliance on historical data from user profiles. Therefore, it possesses stronger dynamic adaptability, capable of responding in real-time to changes in new vocabulary and context, while traditional user profile-based matching relies on long-term accumulated data and cannot quickly adapt to new user needs.

[0071] The construction of a domain-specific thesaurus can be achieved in two ways. First, it can be a thesaurus and domain-specific scene tags created manually or maintained periodically by the operator based on needs. This ensures the thesaurus accurately reflects market demands and user habits. For example, operators can regularly add and adjust synonyms related to specific product categories or user groups based on market trends and user behavior analysis, and manage tags according to different domain scenarios (such as sneakers, electronic products, etc.), thereby continuously updating and optimizing the thesaurus. Second, it can utilize adaptive machine learning algorithms and natural language processing technology to automatically mine new words and industry terms from user behavior, product descriptions, social media, or other external data sources, and expand and adjust them in conjunction with existing synonyms. Thus, adaptive updates help the thesaurus respond promptly to emerging vocabulary, internet slang, and domain-specific expressions, thereby improving the dynamic adaptability and long-term effectiveness of the domain-specific thesaurus.

[0072] Figure 2 A flowchart illustrating an example of adaptively updating a domain thesaurus according to an embodiment of this application is shown.

[0073] like Figure 2 As shown, in step S210, a set of potential non-category word-category word matching pairs is extracted from the multi-source dataset.

[0074] In some implementations, the multi-source dataset can be a complete set of data samples to support a full update of the domain thesaurus. However, a full update can lead to excessive resource consumption and cannot support real-time dynamic iteration, resulting in some popular internet terms not being quickly included in the domain thesaurus. Preferably, the multi-source dataset can be incremental data, such as data added from the previous day or week, thereby supporting rapid updates to the domain thesaurus.

[0075] Here, the multi-source dataset contains information from different dimensions, providing a more comprehensive context to enhance the matching degree between queries and product information. Specifically, the multi-source dataset includes at least one of the following dimensions: product title, user query history, and user review history. Product titles typically contain the core characteristics of the product, such as brand, type, and purpose. User query history reflects their common expressions and potential needs, and often contains a large amount of natural language expressions, such as "lightweight running shoes." User review history may use some special vocabulary or expressions, and can also provide important clues for potential non-category terms; for example, a review might mention "comfortable running shoes."

[0076] Regarding the details of word pair extraction, firstly, the text in the multi-source dataset can be preprocessed, including stop word removal, word segmentation, and lemmatization, to extract useful words from the original text. Then, the platform's service category system is used to identify whether the words belong to a category or not (such as user-inputted custom query terms).

[0077] In some implementations, a potential matching pair can refer to non-category words and category words that appear simultaneously in the same query or context. For example, if a user's query is "versatile student shoes," and the query ultimately returns the user's preferred product category as "sports shoes," then "lightweight sports shoes" - "sports shoes" can be considered a potential non-category word - category word matching pair.

[0078] In some examples of embodiments of this application, at least one popular non-category word is extracted from a multi-source dataset. A popular non-category word is a word whose frequency of occurrence within a predetermined time period exceeds a preset popular frequency threshold and is not included in the search standard vocabulary.

[0079] In some implementations, frequency statistics are performed on each valid word from multiple data sources (such as product titles, user query records, user reviews, etc.) to identify which words appear frequently within a specific time period, and these words may be words that users frequently use but have not yet been included in a standardized vocabulary.

[0080] For example, a trending term like "lazy shoes" might frequently appear in a user's search history, but it hasn't yet been included in the platform's standard vocabulary. To ensure the quality of popular non-category terms, the system sets a frequency threshold, filtering out terms that appear more frequently than this threshold within a predetermined time period as popular non-category terms. This ensures that the system can select truly representative and potentially valuable terms from a large amount of data, avoiding interference from noisy terms.

[0081] Furthermore, based on the semantic similarity between popular non-category words and the vocabulary of each category in the platform's service category system, at least one non-category word-category word matching pair with a corresponding semantic similarity exceeding a preset semantic threshold is selected to generate a potential set of non-category word-category word matching pairs.

[0082] Here, the semantic similarity calculation results are used to roughly screen out which popular non-category words have a sufficiently high semantic relevance to the platform's category words, thus serving as potential matching pairs. For example, "lazy shoes" and "casual shoes" may be highly similar words, while "lazy shoes" and "running shoes" may have a low semantic similarity.

[0083]

[0084] In the formula, N i and C j These are popular non-category keywords N i And category term C j Word vectors, Sim(N) i C j ) represents the cosine similarity between them.

[0085] Furthermore, by using semantic thresholds to filter pairings between non-category words and category words, and treating them as potential matching pairs, it is possible to ensure that only semantically highly matched non-category words and category words are combined together, thus avoiding the introduction of low-relevance word pairs.

[0086] In step S220, the co-occurrence frequency between non-category words and category words in each potential non-category word-category word matching pair is counted, and a subset of candidate word pairs whose co-occurrence frequency exceeds a preset co-occurrence frequency threshold is selected.

[0087] Here, co-occurrence frequency is calculated based on the simultaneous appearance of non-category and category terms in the same query from multiple source datasets. Co-occurrence frequency refers to the number of times a non-category term and a category term appear together in the same query or context. Specifically, if a user repeatedly uses "versatile white sneakers" and "casual shoes" in multiple queries, and both frequently appear in the title or reviews of the same product, then the co-occurrence frequency between "versatile white sneakers" and "casual shoes" will be relatively high.

[0088] Specifically, assuming that the number of times a non-category term N and a category term C co-occur in the same query is f(N,C), then the co-occurrence frequency CF(N,C) can be expressed as:

[0089]

[0090] In the formula, M 总 The total number of queries represents the total number of queries in the entire reference dataset; f(N,C) is the number of times that non-category word N and category word C appear simultaneously.

[0091] After calculating the co-occurrence frequency, a subset of valid candidate word pairs can be selected based on a set co-occurrence frequency threshold. These candidate word pairs are those that appear frequently in the multi-source dataset and have a strong correlation, typically representing the actual needs and connections between users and products. For example, if the co-occurrence frequency of "versatile white sneakers" and "casual shoes" is higher than a certain preset threshold, they can be considered valid matching pairs.

[0092] Specifically, by comparing the co-occurrence frequency with a pre-set threshold T, if the co-occurrence frequency CF(N,C) of a non-category word and a category word is greater than the threshold, the word pair is included in the candidate set. If the co-occurrence frequency is lower than the threshold, the word pair is ignored. This ensures the quality of the candidate word pair subset and avoids the noise influence of low-frequency word pairs.

[0093] In step S230, each word pair in the candidate word pair subset is fused for multiple domain scenarios, and the word pairs are semantically expanded based on the context information of each domain scenario to generate corresponding extended synonyms.

[0094] Here, the candidate non-category word-category word matching pairs are semantically expanded using contextual information from the domain scenario, ensuring that the generated synonyms better fit the corresponding domain scenario. Furthermore, the domain scenario can be defined based on specific product categories (such as "sports shoes" or "running shoes") or user demand scenarios (such as "business shoes" or "home shoes"). The selection of the domain scenario can be achieved through manual annotation or adaptive generation using a data-driven approach.

[0095] In some implementations, various deep learning models, such as Word2Vec (a word vector model) and BERT (Bidirectional Encoder Representations from Transformers), can be used to map word pairs and contextual information of the domain scenario to the same semantic space through semantic embedding techniques, thereby performing semantic extension. This comprehensively considers the semantic information of the words themselves and the domain scenario to generate more extended synonyms that conform to the domain scenario.

[0096] In step S240, a mapping relationship between each extended synonym and its corresponding non-category word is constructed to update the domain thesaurus.

[0097] Specifically, by analyzing the semantic relationship between each extended synonym and non-category word, the system establishes a mapping relationship for each pair. Furthermore, based on the domain scenario used by the extended synonym, it is further subdivided into corresponding domain synonym lists to enrich the thesaurus. Moreover, by subdividing the domain synonym lists, the system can dynamically adjust the thesaurus, ensuring it not only has broad coverage but also provides more accurate semantic matching for specific domain scenarios.

[0098] This application's embodiments, by combining user behavior and product characteristics, accurately identify and expand the semantic relationships between potential non-category words and category words. By statistically analyzing co-occurrence frequencies, the system can filter out high-frequency and actually relevant word pairs, ensuring the quality and effectiveness of synonym expansion. These candidate word pairs are further fused across multiple domain scenarios, making the expanded synonyms more consistent with the contexts of different domains, thereby improving the accuracy of semantic expansion. Finally, by constructing a mapping relationship between expanded synonyms and non-category words and updating the thesaurus, the system can continuously adaptively optimize the thesaurus content, improving the search engine's responsiveness and accuracy to diverse queries, ensuring that users' search needs are met more precisely.

[0099] Figure 3 A flowchart illustrating an example of updating context information of a domain scene according to an embodiment of this application is shown.

[0100] like Figure 3 As shown, in step S310, based on the BERT model, the text data in the multi-source dataset is transformed into semantic vectors.

[0101] The BERT model is a deep learning model that captures contextual information through pre-training, thereby providing high-quality semantic vector representations. Text data typically contains rich semantic information, and BERT considers all words in the context through bidirectional encoding, generating a vector representation of each word in the sentence, thus capturing more semantic layers.

[0102] When text data is input into the BERT model, BERT generates a contextual embedding representation for each word, i.e., a semantic vector. Assume the input text is a sentence X = {x1, x2, ..., x...}. n The BERT model processes each word x through its multi-layer encoder. i Mapped to a fixed-dimensional semantic vector v i ,Right now:

[0103] v i =BERT(x i Equation (5)

[0104] In the formula, v i It is the embedded representation of the word in a given context.

[0105] Therefore, the BERT model generates a corresponding semantic vector for each piece of text data (such as product title, query record, or comment), which represents the deep semantic features of the corresponding text data.

[0106] In step S320, the incremental DBSCAN clustering algorithm is used to cluster the transformed semantic vectors to update or generate at least one cluster, each cluster uniquely corresponding to a domain scenario.

[0107] Here, the incremental DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm is used to cluster the semantic vectors generated by BERT, thereby grouping semantically similar text data into the same cluster, with each cluster representing a domain scenario. Unlike traditional K-Means clustering, which requires pre-specifying the total number of clusters, DBSCAN can automatically handle noisy data and adaptively generate different numbers of clusters based on density. Specifically, the DBSCAN algorithm clusters data by calculating the density between data points. In the data space, regions with higher density are clusters, while regions with lower density are considered noise points. Each cluster consists of a sufficient number of densely connected points.

[0108] More specifically, for each semantic vector v iIncremental DBSCAN calculates the density relationships between data points and other semantic vectors, and clusters them based on a set minimum number of points (MinPts) and a radius threshold (ε). ε determines the cluster radius, and MinPts determines the minimum number of data points required in each cluster. Based on these parameters, DBSCAN clusters data points by calculating their density in space. If a data point has a sufficient number of other data points in its neighborhood, these points will be assigned to the same cluster. Otherwise, the point will be considered noise.

[0109] Furthermore, compared to the traditional DBSCAN, the incremental DBSCAN algorithm can progressively add data points for clustering in a data flow environment without recalculating all points. As new data flows in, incremental DBSCAN can update clusters based on the newly added semantic vectors without recalculating the entire dataset, handling constantly changing data and meeting the needs of adaptive dynamic updates. Therefore, when new text data is converted into semantic vectors, the system only needs to compare this new data with existing clusters and update the corresponding clusters.

[0110] In step S330, the feature information of the cluster center of each cluster is extracted to update the context information of the corresponding domain scene.

[0111] The cluster center represents the core feature of the cluster and is the essence of the clustering result. It can provide core features that provide semantic summarization for each domain scenario.

[0112] For example, the features of the cluster center can be the mean vector of all points in the cluster, reflecting the overall semantic features of the cluster. Assume cluster C... k It contains n semantic vectors v1, v2, ..., v n Cluster center v center The calculation formula is:

[0113]

[0114] In the formula, v s For cluster C k For each semantic vector in the set, n is the size of the cluster.

[0115] The cluster center represents the core features of the scenario in that domain. The system can extract the contextual information of the scenario by analyzing the semantic vector of the cluster center, which can reflect the user's focus on a certain product category or demand scenario.

[0116] Therefore, by extracting the feature information of the cluster center and updating the context information of the domain scenario, we can accurately reflect the semantic changes of different domain scenarios, ensuring that the domain scenario can be updated adaptively and dynamically respond to changes in market and user behavior.

[0117] Regarding the implementation details of generating extended synonyms in step S230, in some examples of embodiments of this application, it can also be implemented in conjunction with a large language model. Through the powerful natural language understanding and generation capabilities of the large language model, the system can more intelligently generate extended synonyms based on specific domain scenarios, and ensure that the generated synonyms meet the corresponding constraints through prompt word constraint engineering.

[0118] In some implementations, candidate word pairs and contextual information of various domain scenarios are combined with preset scenario expansion prompt templates to construct scenario expansion prompt words. The scenario expansion prompt words are then input into a large language model to output various expanded synonyms through chain reasoning according to each constraint. The platform service category system is used as a knowledge base to support the large language model in generating expanded synonyms.

[0119] More specifically, the scenario expansion prompt template includes scenario goal constraints, semantic consistency constraints, and category word constraints. Scenario goal constraints define that the semantic expansion direction of synonyms should conform to the contextual information of the corresponding domain scenario. For example, in a "sports scenario," expanded synonyms should be words related to "sports," such as "running shoes" or "training shoes." Semantic consistency constraints define that the generated synonyms should be consistent with the core semantic attributes of non-category words in the candidate word pair. For example, for "thin laptop," the generated synonyms should be associated with "notebook," not with other product semantics such as "thin clothing." Category word constraints define that the generated synonyms should include category words from the platform's service category system. For example, the expanded synonyms for "versatile student shoes" should be related to the "student shoes" category on the e-commerce platform, thus better meeting the platform's retrieval needs. These constraints are effectively controlled through the prompt template input from the large language model, ensuring that the generated synonyms both meet semantic requirements and maintain consistency with the platform's service system.

[0120] The scene extension prompt template can also be supplemented with various extension examples to facilitate learning by large language models. A simplified example of the scene extension prompt template is shown below:

[0121] {

[0122] Generate extended synonyms related to "{non-category term}". The following constraints apply when generating extended synonyms:

[0123] 1. Scenario target constraint: Extended synonyms must be related to "{contextual information of the domain scenario}" and conform to the semantic features of the scenario.

[0124] 2. Semantic consistency constraint: Extended synonyms should be consistent with the core attributes of "{non-category words}".

[0125] 3. Category term constraint: The extended synonyms should contain at least one category term from "{Platform Service Category System}".

[0126] Please generate a set of expanded synonyms based on these requirements, while maintaining vocabulary diversity and accuracy.

[0127] }”

[0128] In some implementations, large language learning models can employ a chain-like reasoning structure to reason step-by-step according to the constraints in the input prompts.

[0129] Specifically, the model first parses the scene target constraints to identify the domain scene for which extended synonyms need to be generated (such as "sports scene" or "leisure scene"). Based on the attributes of the scene, the model infers the characteristics that the extended synonyms should possess, such as "comfort", "sports performance" or "lightweight".

[0130] For example, for the non-category term "versatile student shoes", if the context is related to "sports", the model will infer that the synonyms should include related categories such as "running shoes" and "fitness shoes", and these synonyms should have the characteristics of being suitable for sports.

[0131] During the reasoning process for maintaining semantic consistency constraints, the system checks whether the generated synonyms are consistent with the core semantics of the original non-category words. It ensures that the extended synonyms match the meaning of the original words and do not deviate from or change the core attributes of the original query (e.g., product attributes and functional attributes). For example, it confirms that these synonyms still conform to the core feature of "versatile student shoes".

[0132] During the reasoning process for category term constraints, based on the platform's product category system, the model ensures that the extended synonyms contain one or more category terms. Here, the aim is to verify the extended synonyms. For example, if the synonyms are "beach shoes" or "slip-on shoes" that do not conform to the platform's category system, then the generated synonyms should include extended synonyms containing platform category terms, such as "beach casual shoes," "beach sandals," "lightweight casual shoes," and "comfortable casual shoes," to achieve effective invocation of the platform's search engine through matching category terms.

[0133] By introducing a chain-like reasoning structure, the large language model can progressively follow various constraints and systematically generate expanded synonyms that meet actual needs. Utilizing structured reasoning ensures the accuracy of expanded synonyms at multiple levels, thereby improving the relevance and accuracy of search matching.

[0134] It should be understood that large language learning models can be diverse, such as general-purpose large language models (GPT series, Qwen series, etc.) or specialized large models, and can also be implemented by appropriately fine-tuning general-purpose large language models.

[0135] As a further optimization of the implementation method, a comprehensive loss function can be designed to fine-tune the general-purpose large language model, enabling it to simultaneously consider semantic consistency, scene adaptability, and category word matching. This ensures that the large language model can understand and learn various cue constraints through training and fine-tuning. The fine-tuned model exhibits stronger semantic understanding and scene adaptability when generating synonyms, and the generated extended synonyms are more in line with user intent, improving the accuracy of the search system and the user experience.

[0136] Figure 4 A schematic diagram of an example interface interaction for managing a dataset of fine-tuning a large language model according to an embodiment of this application is shown.

[0137] like Figure 4 As shown, the dataset management interface contains information about multiple datasets, including fields such as dataset name, import status, publication status, data source, creator, creation time, and modification time. Users can view detailed information about existing datasets on this interface and upload new datasets by clicking the "Create Dataset" button. Users can easily manage multiple datasets. Furthermore, the data source for datasets is not limited to local uploads; it can also be automatically and incrementally uploaded periodically based on data from the e-commerce platform's backend. This ensures that the content of the datasets always remains consistent with actual business and user needs, thereby improving the real-time performance and accuracy of the model during fine-tuning.

[0138] Figure 5 A schematic diagram of the interface interaction is shown as an example of a fine-tuning task for fine-tuning a large language model according to an embodiment of this application.

[0139] like Figure 5 As shown, in the management interface for fine-tuning tasks, users can create new fine-tuning tasks and configure them accordingly. Users can select the appropriate general-purpose large language model (such as the Qwen model), training method (such as LoRA or full update), and set other task parameters as needed. Furthermore, the fine-tuning task can display performance comparisons of different large language models, helping users choose the most suitable base model for fine-tuning. In this way, users can select the most appropriate model and training method based on the specific needs of the task, improving the efficiency and accuracy of the fine-tuning task, thereby ensuring that the final trained model can better adapt to the specific application scenarios of the e-commerce platform.

[0140] Below, we will compare the semantic expansion matching method based on the domain thesaurus provided in the embodiments of this application with the three current e-commerce search semantic matching methods, namely, the differences in the thesaurus expansion effect between "keyword matching-based search technology", "extension technology based on static thesaurus dictionary" and "word vector model trained on general corpus".

[0141] Specifically, keyword-based search technology matches user queries with product text using keyword matching. In the static thesaurus-based expansion technology, a pre-built thesaurus expands the query terms with product information. In the word vector model trained on a general corpus, the Word2Vec general word vector model is used for query expansion.

[0142] The specific experimental tasks are as follows:

[0143] Synonym expansion quality. A comparison is made between the proposed solution and traditional methods in terms of accuracy and recall in synonym expansion.

[0144] Adaptability to Emerging Vocabulary. Evaluate the ability of the proposed technical solution to identify and expand upon emerging vocabulary, particularly in rapidly changing scenarios such as trending terms on e-commerce platforms and terms used in new product categories, to verify the solution's adaptability.

[0145] Query matching accuracy. This test evaluates the performance of four methods in handling e-commerce search queries, assessing metrics such as retrieval accuracy, recall, and precision.

[0146] Regarding the dataset, the comparative experiments used real datasets from e-commerce platforms, including:

[0147] Product title data: Contains approximately 10,000 product titles.

[0148] Product descriptions and functional information: Approximately 5,000 detailed product descriptions.

[0149] User behavior data includes user clicks, purchases, query history, and other behavioral data.

[0150] User review data: Approximately 3,000 user reviews, including product ratings and comment content.

[0151] All data has undergone standardized preprocessing.

[0152] The selected synonym expansion quality evaluation indicators are as follows:

[0153] Accuracy: The ratio of queries that match exactly to all queries.

[0154] Recall: The ratio of relevant queries that the system can recall to all relevant queries.

[0155] Precision: The ratio of correctly matched queries to all expanded match results in an expanded query.

[0156] F1-score: The harmonic mean of precision and recall.

[0157] Table 1. Comparison of the quality of synonym expansion

[0158] method accuracy Recall rate Accuracy F1 score Keyword matching 0.72 0.65 0.68 0.66 Based on static thesaurus 0.78 0.70 0.75 0.72 Based on the general word vector model 0.80 0.76 0.79 0.77 Technical Solution in this Article 0.85 0.83 0.82 0.82

[0159] As shown in Table 1, the proposed solution is significantly superior to the other three methods, especially in terms of accuracy and recall. This demonstrates that the proposed solution combines a dynamically updated domain thesaurus and contextual information, enabling the generation of more accurate synonyms and avoiding the problem that static thesaurus cannot cover emerging words and changing scenarios.

[0160] Figure 6 This figure illustrates an example of experimental comparison simulation results for the quality of synonym expansion using different methods. The figure shows a performance comparison of four methods in terms of synonym expansion quality, primarily evaluated using four metrics: accuracy, recall, precision, and F1-score. The horizontal axis represents the four different methods.

[0161] like Figure 6 As shown, the semantic expansion matching method based on a domain thesaurus presented in this paper significantly outperforms other methods across all metrics, particularly in accuracy and recall, demonstrating higher query matching precision and more comprehensive semantic expansion capabilities. By combining a dynamically updated domain thesaurus with contextual information, it can effectively handle synonyms, abbreviations, and emerging terms, providing more accurate matching and broader adaptability, further enhancing search capabilities and user experience in rapidly changing environments such as e-commerce platforms.

[0162] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of combined actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application. In the above embodiments, the descriptions of each embodiment have their own emphasis; for parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0163] Figure 7 A structural block diagram of an example of a semantic expansion matching system based on a domain thesaurus, according to an embodiment of this application, is shown.

[0164] like Figure 7 As shown, the semantic extended matching system 700 based on a domain thesaurus includes an acquisition unit 710, a thesaurus set determination unit 720, a context analysis unit 730, and an extended query generation unit 740.

[0165] The acquisition unit 710 is used to acquire the query keywords input by the user and generate a query context vector based on the query keywords and user context information.

[0166] The synonym set determination unit 720 is used to determine an extended synonym set that matches the query keyword from a pre-built domain synonym library if the query keyword does not belong to the category words in the platform service category system. The domain synonym library contains multiple domain scenarios and corresponding domain synonym tables. Each domain synonym table is used to record multiple sets of mapping relationships between non-category words and extended synonyms. The extended synonyms are synonyms generated by semantically expanding the category words that match the non-category words in combination with the context information of the corresponding domain scenario. The category words are standard vocabulary in the platform service category system.

[0167] The context analysis unit 730 is used to calculate the relevance score between each extended synonym in the extended synonym set and the query context vector, and obtain the corresponding weighted extended synonyms through weighted processing.

[0168] The extended query generation unit 740 is used to generate an extended query index for search matching based on each of the weighted extended synonyms and the query keywords.

[0169] In some embodiments, this application provides a non-volatile computer-readable storage medium storing one or more programs including execution instructions. The execution instructions can be read and executed by an electronic device (including but not limited to a computer, server, or network device) to perform the steps of any of the semantic expansion matching methods based on a domain thesaurus described above.

[0170] In some embodiments, this application also provides a computer program product, the computer program product including a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the steps of any of the above-described semantic expansion matching methods based on a domain thesaurus.

[0171] In some embodiments, this application also provides an electronic device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform steps of a semantic expansion matching method based on a domain thesaurus.

[0172] The above-described product can perform the methods provided in the embodiments of this application, and has the corresponding functional modules and beneficial effects for performing the methods. Technical details not described in detail in this embodiment can be found in the methods provided in the embodiments of this application.

[0173] The electronic devices in this application can exist in various forms, including but not limited to: mobile communication devices, ultra-mobile personal computer devices, portable entertainment devices, or other airborne electronic devices with data interaction functions.

[0174] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0175] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software plus a general-purpose hardware platform, or of course, using hardware. Based on this understanding, the above technical solutions, in essence or the parts that contribute to the related technology, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0176] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A semantic expansion matching method based on a domain thesaurus, characterized in that, The method includes: Obtain the query keywords input by the user, and generate a query context vector based on the query keywords and user context information; If the query keyword does not belong to the category words in the platform service category system, then an extended synonym set matching the query keyword is determined from the pre-built domain synonym library. The domain synonym library contains multiple domain scenarios and corresponding domain synonym tables. Each domain synonym table is used to record multiple sets of mapping relationships between non-category words and extended synonyms. The extended synonyms are synonyms generated by semantically expanding the category words that match the non-category words with the context information of the corresponding domain scenario. For each extended synonym in the extended synonym set, the relevance score between the extended synonym and the query context vector is calculated, and the corresponding weighted extended synonym is obtained through weighted processing; Based on the weighted extended synonyms and the query keywords, an extended query index for search matching is generated; The update of the domain thesaurus includes: Extract a set of potential non-category word-category word matching pairs from a multi-source dataset, wherein the multi-source dataset contains at least one of the following dimensions: product title, user history query records, and user history comment information; The co-occurrence frequency between non-category words and category words in each potential non-category word-category word matching pair is statistically analyzed, and a subset of candidate word pairs whose co-occurrence frequency exceeds a preset co-occurrence frequency threshold is selected; the co-occurrence frequency is calculated based on the non-category words and category words that appear simultaneously in the same query by the user in the multi-source dataset; Each word pair in the candidate word pair subset is fused for multiple domain scenarios, and the word pairs are semantically expanded based on the context information of each domain scenario to generate corresponding extended synonyms. Construct a mapping relationship between each of the extended synonyms and the corresponding non-category words to update the domain thesaurus; The process of fusing each word pair in the candidate word pair subset across multiple domain scenarios and semantically expanding the word pairs based on the contextual information of each domain scenario to generate corresponding extended synonyms includes: The candidate word pairs and contextual information of each of the aforementioned domain scenarios are combined with a preset scenario expansion prompt template to construct scenario expansion prompt words. These scenario expansion prompt words are then input into a large language model to output each expanded synonym according to each constraint through chain reasoning. The platform service category system serves as a knowledge base to support the large language model in generating expanded synonyms. The scenario expansion prompt template includes scenario target constraints, semantic consistency constraints, and category word constraints. The scenario target constraints define that the semantic expansion direction of synonyms should conform to the context information of the corresponding domain scenario. The semantic consistency constraints define that the extended synonyms should be consistent with the core semantic attributes of non-category words in the candidate word pair. The category word constraints define that the extended synonyms should include category words in the platform service category system.

2. The method according to claim 1, wherein, The extraction of a set of potential non-category word-category word matching pairs from a multi-source dataset includes: Extract at least one popular non-category word from the multi-source dataset. The popular non-category word is a word whose frequency of occurrence exceeds a preset popular frequency threshold within a predetermined time period and is not included in the search standard vocabulary. Based on the semantic similarity between the popular non-category words and the category words in the platform service category system, at least one non-category word-category word matching pair with a corresponding semantic similarity exceeding a preset semantic threshold is selected to generate a potential set of non-category word-category word matching pairs.

3. The method according to claim 1, wherein, Before fusing each word pair in the candidate word pair subset for multiple domain scenarios and semantically expanding the word pairs based on the context information of each domain scenario to generate corresponding extended synonyms, the method further includes: Based on the BERT model, the text data in the multi-source dataset is transformed into semantic vectors; The transformed semantic vectors are clustered using the incremental DBSCAN clustering algorithm to update or generate at least one cluster; where each cluster uniquely corresponds to a domain scenario. Extract the feature information of the cluster center of each cluster to update the context information of the corresponding domain scene.

4. The method according to claim 1, wherein, The calculation of the relevance score between the extended synonyms and the query context vector, and the resulting weighted extended synonyms through weighted processing, includes: The relevance score between the extended synonyms and the query context vector is calculated using a gating network; If the relevance score is less than or equal to a preset relevance threshold, then the extended synonyms are filtered out; and If the relevance score exceeds the relevance threshold, the extended synonyms are weighted according to the relevance score to obtain the corresponding weighted synonyms.

5. The method according to claim 1, wherein, The step of generating an extended query index for search matching based on each of the weighted extended synonyms and the query keywords includes: The weighted extended synonyms are sorted by an inverted index according to their relevance scores to form a weighted extended synonym set. The query keywords are concatenated with the weighted extended synonym set to generate an extended query index for search matching.

6. A semantic expansion matching system based on a domain thesaurus, characterized in that, The system is used to implement the method as described in any one of claims 1-5; the system comprises: The acquisition unit is used to acquire the query keywords input by the user and generate a query context vector based on the query keywords and user context information; The synonym set determination unit is used to determine an extended synonym set matching the query keyword from a pre-built domain synonym library if the query keyword does not belong to a category word in the platform service category system. The domain synonym library contains multiple domain scenarios and corresponding domain synonym tables. Each domain synonym table is used to record multiple sets of mapping relationships between non-category words and extended synonyms. The extended synonyms are generated by semantically expanding the category words that match the non-category words in combination with the context information of the corresponding domain scenario. The category words are standard vocabulary in the platform service category system. The context analysis unit is used to calculate the relevance score between each extended synonym in the extended synonym set and the query context vector, and obtain the corresponding weighted extended synonyms through weighted processing. An extended query generation unit is used to generate an extended query index for search matching based on each of the weighted extended synonyms and the query keywords.