Text similarity matching method, device, equipment and medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using word segmentation, syntactic analysis, and conditional random field tagging of word segmentation attributes, the problem of distinguishing between semantic and key similarity in text similarity matching is solved, improving the accuracy and applicability of text similarity matching in complex scenarios.

CN121542777BActive Publication Date: 2026-06-23CHINA TELECOM CORP LTD +1

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA TELECOM CORP LTD
Filing Date: 2026-01-19
Publication Date: 2026-06-23

Application Information

Patent Timeline

19 Jan 2026

Application

23 Jun 2026

Publication

CN121542777B

IPC: G06F18/22; G06F18/25; G06F40/30; G06F40/284; G06F40/211

AI Tagging

Application Domain

Semantic analysis

Technology Topics

Conditional random fieldAlgorithm

Technical Efficacy Phrases

Improve robustness improve accuracy

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power maintenance action prediction method based on target consistency screening and bidirectional state space
CN122435675AAvoid logic fragmentation problemsReduce memory consumption
Projection-based knowledge distillation method based on adaptive mask weighting
CN117454971BSolve the problem of large differences in expression abilityImprove robustness
A method for improving the resilience of a mobile hydrogen energy storage power distribution system
CN122292461AImprove load recovery capabilitycoordination robustnessData acquisition Solar power
LIBS element quantitative analysis method based on double-branch feature fusion and electronic equipment
CN121577609BRealize automated global optimizationimprove accuracy Adaptive weighting Algorithm
Wind dam construction method and system for concentrated wind power harvesting
CN122365649AAchieve active guidanceImprove centralized collection efficiencyComputer Aided Design Simulation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing text similarity matching technologies struggle to distinguish between structural similarity and entity semantic similarity at the semantic level when identifying user query intent, leading to misjudgments. In particular, their accuracy is insufficient in cases of sentence interference or terminology variations, limiting their applicability and robustness in complex scenarios.

Method used

By using word segmentation, syntactic analysis, and Conditional Random Field (CRF) to label the syntactic attributes and key tags of the segmented words, semantic similarity and key similarity are calculated. Based on the fusion threshold and weights, the text similarity is finally calculated.

Benefits of technology

It improves the accuracy and robustness of text similarity matching, effectively distinguishing sentence structure and entity semantics in complex business scenarios, and reducing misjudgments.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121542777B_ABST

Patent Text Reader

Abstract

Embodiments of the present disclosure provide a text similarity matching method and device, equipment, medium and program product, relating to the technical field of computer. The method comprises: obtaining a first text and a second text; on the one hand, a syntax analyzer is used to mark the syntax attribute of each word segmentation, and then the semantic similarity between the two texts is calculated; on the other hand, a conditional random field is used to mark the key label of each word segmentation, and then the key similarity between the two texts is calculated; according to the comparison of the key similarity and the fusion threshold, the first fusion weight and the second fusion weight are determined; based on the first fusion weight, the second fusion weight, the semantic similarity and the key similarity, the similarity between the first text and the second text is calculated. The method separates the calculation of semantic similarity and key similarity, solves the misjudgment problem caused by confusing sentence structure and entity semantics, and improves the robustness and accuracy of text similarity matching in complex business scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and more specifically, to a text similarity matching method, apparatus, electronic device, storage medium, and computer program product. Background Technology

[0002] In applications such as customer service systems and business intelligence analytics, the accuracy of text similarity matching is crucial for identifying user query intent. However, existing text matching technologies commonly confuse "structural similarity" with "entity semantic similarity," making it difficult to achieve fine-grained semantic distinction and leading to frequent misjudgments in critical business scenarios. For example, when query statements targeting different business objects (such as table A and table B) are highly similar in sentence structure, traditional methods easily misjudge them as semantically similar due to structural consistency; conversely, when the expressions differ but the core terms are equivalent (such as "CPU" and "central processing unit"), they are misclassified because the semantic relationship between the terms is not identified. Such misjudgments seriously affect the accuracy of system decisions and user experience. Current mainstream text similarity models typically treat text as a whole for vectorization and similarity calculation, lacking distinction between the semantic roles of function words and entity words. Therefore, when faced with sentence interference or terminology variations, traditional methods often fail to accurately assess semantic similarity, limiting their applicability and robustness in complex real-world scenarios. Summary of the Invention

[0003] This disclosure provides a text similarity matching method, apparatus, electronic device, storage medium, and computer program product.

[0004] Other features and advantages of this disclosure will become apparent from the following detailed description, or may be learned in part from practice of this disclosure.

[0005] According to one aspect of this disclosure, a text similarity matching method is provided, the method comprising: acquiring a first text and a second text; performing word segmentation on the first text and the second text to obtain a first word segmentation set and a second word segmentation set respectively; the first word segmentation set including a plurality of first words; the second word segmentation set including a plurality of second words; performing syntactic analysis on the first text and the second text respectively using a syntactic analyzer, and marking the first word segmentation and the second word segmentation with corresponding syntactic attributes respectively; calculating the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first word segmentation and the second word segmentation; and using a Conditional Random Field based on a preset domain knowledge base. A Convolutional Field (CRF) is used to perform keyness analysis on the first and second word segments, and to label the first and second word segments with corresponding key tags. Based on the key tags corresponding to each of the first and second word segments, the key similarity between the first and second texts is calculated. Based on the comparison between the key similarity and the fusion threshold, a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity are determined. Based on the first fusion weight, the second fusion weight, the semantic similarity, and the key similarity, the similarity between the first and second texts is calculated.

[0006] In an exemplary embodiment, the step of performing syntactic analysis on the first text and the second text respectively using a syntactic analyzer, and marking the first and second segments with corresponding syntactic attributes, includes: performing syntactic analysis on the first text and the second text respectively using a syntactic analyzer to obtain the dependency relation types corresponding to each first and second segment, as well as the headword index and dependency word index of the first and second texts; filtering the first and second segment sets according to the dependency relation types; and marking the first and second segments in the filtered first and second segment sets with corresponding syntactic attributes according to the dependency relation types, headword index, and dependency word index.

[0007] In an exemplary embodiment, calculating the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first and second segmented words includes: generating a first feature vector corresponding to the first text based on the syntactic attributes of each of the first segmented words and the first text; generating a second feature vector corresponding to the second text based on the syntactic attributes of each of the second segmented words and the second text; and calculating the semantic similarity between the first text and the second text based on the first feature vector and the second feature vector.

[0008] In an exemplary embodiment, the method further includes: performing syntactic structure analysis on the first text and the second text using a syntactic structure analysis module to obtain a first syntactic structure feature and a second syntactic structure feature, respectively; generating a first feature vector corresponding to the first text based on the syntactic attributes of each of the first segmented words, the first syntactic structure feature, and the first text; and generating a second feature vector corresponding to the second text based on the syntactic attributes of each of the second segmented words, the second syntactic structure feature, and the second text.

[0009] In an exemplary embodiment, the key analysis of the first and second word segments based on a preset domain knowledge base using a conditional random field is performed, and the first and second word segments are respectively labeled with corresponding key tags, including: performing key analysis of the first and second word segments based on a preset domain knowledge base using a conditional random field, and labeling the key tags of each of the first and second word segments as core words or ordinary words; in response to the first and second word segments satisfying the synonym mapping relationship in the domain knowledge base, labeling the key tags of the first and second word segments as synonyms; the domain knowledge base stores a set of synonym mapping relationships.

[0010] In an exemplary embodiment, the step of calculating the key similarity between the first text and the second text based on the key tags corresponding to each of the first and second segmented words includes: constructing a plurality of segmentation pairs based on the correspondence between each segmented word in the first and second segmentation sets; each segmentation pair includes a first segmented word and / or a second segmented word that are in a corresponding relationship; determining key weights based on the key tags corresponding to the segmented words in the segmentation pairs; calculating the similarity between the segmented words in the segmentation pairs to obtain segmentation similarity; calculating the segmentation pair similarity of the segmentation pairs based on the key weights and segmentation similarity; and determining the key similarity between the first text and the second text based on the segmentation pair similarity of each of the segmentation pairs.

[0011] In an exemplary embodiment, determining the key weight based on the key tags corresponding to the word segments in the word segmentation pair includes: determining the key weight as a first weight in response to the key tag corresponding to the word segment in the word segmentation pair being a core word; determining the key weight as a second weight in response to the key tag corresponding to the word segment in the word segmentation pair being a common word; and determining the key weight as a third weight in response to the key tag corresponding to the word segment in the word segmentation pair being a synonym; wherein the first weight is greater than the second weight, which is greater than the third weight.

[0012] In an exemplary embodiment, the step of determining the key weight as a third weight in response to the key tag corresponding to the word segment in the word segmentation pair being a synonym includes: determining the synonym weight corresponding to the word segment in the word segmentation pair based on the synonym mapping relationship in the domain knowledge base; and determining the third weight based on the synonym weight.

[0013] In an exemplary embodiment, determining the critical similarity between the first text and the second text based on the word pair similarity of each word pair includes: performing a weighted calculation based on the word pair similarity of each word pair to obtain the critical similarity between the first text and the second text.

[0014] In an exemplary embodiment, determining the critical similarity between the first text and the second text based on the word pair similarity of each of the word pair includes: determining the critical similarity between the first text and the second text based on the maximum value among the word pair similarities of each of the word pair.

[0015] In an exemplary embodiment, determining a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity based on a comparison of the key similarity and the fusion threshold includes: determining that the first fusion weight is less than the second fusion weight in response to the key similarity being greater than the fusion threshold; and determining that the first fusion weight is greater than the second fusion weight in response to the key similarity being less than the fusion threshold.

[0016] In an exemplary embodiment, the method further includes: determining the proportion of shared words based on the first word segmentation set and the second word segmentation set; and determining the fusion threshold based on the proportion of shared words.

[0017] According to another aspect of this disclosure, a text similarity matching apparatus is provided, comprising: a text acquisition module configured to acquire a first text and a second text; a word segmentation module configured to perform word segmentation processing on the first text and the second text to obtain a first word segmentation set and a second word segmentation set respectively; the first word segmentation set includes a plurality of first words; the second word segmentation set includes a plurality of second words; a syntactic analysis module configured to perform syntactic analysis on the first text and the second text respectively using a syntactic analyzer, and to mark the first word segmentation and the second word segmentation with corresponding syntactic attributes respectively; a semantic similarity module configured to calculate the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first word segmentation and the second word segmentation; and a keyness analysis module. The module is configured to perform key analysis on the first and second word segments using a conditional random field based on a preset domain knowledge base, and to label the first and second word segments with corresponding key tags respectively; the key similarity module is configured to calculate the key similarity between the first and second texts based on the key tags corresponding to each of the first and second word segments; the fusion weight module is configured to determine a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity based on a comparison of the key similarity with a fusion threshold; and the similarity calculation module is configured to calculate the similarity between the first and second texts based on the first fusion weight, the second fusion weight, the semantic similarity, and the key similarity.

[0018] According to another aspect of this disclosure, an electronic device is provided, comprising: one or more processors; and a storage device configured to store one or more programs, which, when executed by the one or more processors, cause the one or more processors to implement the text similarity matching method as described in the above embodiments.

[0019] According to another aspect of this disclosure, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program that, when executed by a processor, implements the text similarity matching method as described in the above embodiments.

[0020] According to another aspect of this disclosure, a computer program product is provided, including a computer program / signaling, characterized in that, when the computer program / signaling is executed by a processor, it implements the text similarity matching method as described in the above embodiments.

[0021] The text similarity matching method provided in this disclosure obtains a first text and a second text. On one hand, it uses a syntactic analyzer to mark the syntactic attributes of each word segmentation tag, thereby calculating the semantic similarity between the two texts. On the other hand, it uses a conditional random field to mark the key tags of each word segmentation tag, thereby calculating the key similarity between the two texts. Based on a comparison between the key similarity and a fusion threshold, a first fusion weight and a second fusion weight are determined. Based on the first fusion weight, the second fusion weight, the semantic similarity, and the key similarity, the similarity between the first text and the second text is calculated. This method solves the problem of misjudgment caused by confusing sentence structure with entity semantics by calculating semantic similarity and key similarity separately, thus improving the robustness and accuracy of text similarity matching in complex business scenarios.

[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0023] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.

[0024] Figure 1 A schematic diagram of an exemplary system architecture according to an embodiment of this disclosure is shown;

[0025] Figure 2 A flowchart of a text similarity matching method according to an embodiment of this disclosure is shown;

[0026] Figure 3 A flowchart of a syntactic attribute tagging method according to an embodiment of this disclosure is shown;

[0027] Figure 4 A flowchart of a semantic similarity matching method according to an embodiment of this disclosure is shown;

[0028] Figure 5 A flowchart illustrating a key labeling method according to an embodiment of this disclosure is shown;

[0029] Figure 6 A flowchart of a key similarity matching method according to an embodiment of this disclosure is shown;

[0030] Figure 7 A flowchart of the similarity fusion calculation method according to an embodiment of this disclosure is shown;

[0031] Figure 8 A schematic diagram of the structure of a text similarity matching device according to an embodiment of the present disclosure is shown;

[0032] Figure 9A schematic diagram of the structure of an electronic device suitable for implementing exemplary embodiments of the present disclosure is shown. Detailed Implementation

[0033] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, they are provided so that this disclosure will be more comprehensive and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

[0034] Furthermore, the accompanying drawings are merely illustrative of this disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and therefore repeated descriptions of them will be omitted. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0035] It should be noted that the ordinal numbers such as "first" and "second" mentioned in the embodiments of this disclosure are used to distinguish multiple objects, and are not used to limit the order, timing, priority or importance of multiple objects. Furthermore, the descriptions of "first" and "second" do not limit the objects to necessarily being different.

[0036] Figure 1 A schematic diagram of an exemplary system architecture according to an embodiment of this disclosure is shown.

[0037] like Figure 1 As shown, the system architecture may include server 101, network 102, terminal device 103, terminal device 104, and terminal device 105. Network 102 serves as the medium for providing a communication link between terminal device 103, terminal device 104, or terminal device 105 and server 101. Network 102 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0038] Server 101 can be a server that provides various services, such as a back-end management server that supports the devices operated by users using terminal devices 103, 104, or 105. The back-end management server can analyze and process received requests and other data, and feed back the processing results to terminal devices 103, 104, or 105.

[0039] Terminal devices 103, 104, and 105 can be smartphones, tablets, laptops, desktop computers, smart speakers, wearable smart devices, virtual reality devices, augmented reality devices, etc., but are not limited to these.

[0040] It should be understood that Figure 1 The number of terminal devices 103, 104, 105, network 102, and server 101 in the diagram is merely illustrative. Server 101 can be a single physical server, a server cluster consisting of multiple servers, or a cloud server. Depending on actual needs, it can have any number of terminal devices, networks, and servers.

[0041] The steps of the method in the exemplary embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings and examples.

[0042] Figure 2 A flowchart of a text similarity matching method according to an embodiment of this disclosure is shown. Figure 2 As shown, this text similarity matching method may include the following steps.

[0043] In step S210, the first text and the second text are obtained.

[0044] In related technologies, text similarity calculation suffers from at least three problems: First, in structural similarity judgment, it cannot effectively distinguish the semantic roles of function words and entity words, leading to misjudgments in similarity calculations for sentences with similar structures but different meanings (such as "How many tables are there in category A?" versus "How many tables are there in category B?"). Second, in terms of term equivalence recognition, existing algorithms lack the introduction and integration of domain knowledge, causing semantically equivalent term pairs (such as "CPU" and "central processing unit") to be judged as dissimilar due to large word vector distances, resulting in the fragmentation of synonym queries and missed retrievals. Furthermore, traditional methods use a fixed weight mechanism when fusing structural and entity features, failing to dynamically adjust the importance ratio according to the actual scenario, resulting in weak overall generalization ability. These shortcomings collectively limit the accuracy and robustness of existing technologies in real-world complex environments.

[0045] In this embodiment of the disclosure, a first text and a second text to be compared are obtained. This step may also integrate some basic text cleaning and standardization procedures. For example, this may include removing irrelevant special characters, standardizing character encoding formats (such as converting full-width characters to half-width characters), correcting obvious spelling errors, or uniformly converting the text to lowercase to eliminate interference from capitalization. Related text preprocessing procedures should also be considered within the scope of protection of this disclosure.

[0046] In step S220, the first text and the second text are segmented to obtain a first segmentation set and a second segmentation set, respectively; the first segmentation set includes a number of first segments; the second segmentation set includes a number of second segments.

[0047] In this embodiment, word segmentation algorithms (such as dictionary-based matching algorithms, statistical models, or deep learning models) are used to segment the input first and second texts into a series of meaningful word units, thereby generating a first word segmentation set and a second word segmentation set. The first word segmentation set includes several first words, and the second word segmentation set includes several second words. Each word segment is the smallest word unit that expresses a certain semantic meaning. The relevant word segmentation techniques are conventional techniques in this field and will not be described in detail here.

[0048] For example, the text "Query the 5G chip performance parameters of brand AA product BB in Q3 2023" can be split into a word segmentation set {query, Q3 2023, brand AA, product BB, 5G chip, performance parameters} using a word segmentation algorithm.

[0049] In step S230, the first text and the second text are syntactically analyzed by the syntactic analyzer, and the first and second word segments are marked with their corresponding syntactic attributes.

[0050] In this embodiment of the disclosure, steps S230 and 240 compare the semantic similarity between texts by analyzing their syntactic structure. During this similarity calculation, a syntactic analyzer (e.g., Stanford CoreNLP, HanLP, etc.) is introduced to perform syntactic analysis on the first and second texts. The core task of syntactic analysis is to identify the grammatical relationships between words in a sentence, such as subject-verb, verb-object, and attributive-head relationships. Specifically, the analyzer outputs the syntactic attributes of each word segment. These attributes typically include the word's dependency relationship type in the sentence, its corresponding headword index, and its own index as a dependency word, thereby constructing the dependency tree of the entire sentence.

[0051] In this embodiment of the disclosure, based on the syntactic attributes of each word in the first and second texts obtained by the syntactic analyzer, the corresponding syntactic attributes are marked for each first and second word respectively. For example, in the sentence "query the number of Class A tables", the syntactic analyzer will mark "query" as the root node (core verb), "number" as its object, and "Class A tables" as the modifier of "number". By marking the syntactic attributes of each word, the importance of each word in the overall semantic expression can be determined.

[0052] In an exemplary embodiment, some non-core segmenters (e.g., articles, particles, etc.) can also be identified based on the syntactic attributes marked by the syntactic analyzer for each segment. By filtering out these non-core segmenters from the segmentation set, the relation set of the main text structure is preserved.

[0053] In step S240, the semantic similarity between the first text and the second text is calculated based on the syntactic attributes corresponding to each of the first and second segmented words.

[0054] In this embodiment, based on the aforementioned syntactic attributes corresponding to each first and second segment, the first text and the second text are respectively transformed into corresponding first and second feature vectors. Specifically, a high-dimensional feature vector is generated for the entire text based on the syntactic attributes of each segment and its position in the text. Because the syntactic attributes corresponding to each segment are introduced during the feature vector transformation process, the feature vector, while reflecting the original text information features, further highlights the feature expression of the core segment, thus better representing the semantic features of the text.

[0055] In this embodiment of the disclosure, the semantic similarity between the first and second texts is obtained by calculating the distance or angle between the first and second feature vectors in the vector space. For example, the cosine similarity between two feature vectors is calculated. This semantic similarity reflects the degree of similarity between the two texts in sentence structure and grammatical function.

[0056] In step S250, the first and second word segments are analyzed for key characteristics using a Conditional Random Field (CRF) based on a preset domain knowledge base, and the first and second word segments are labeled with corresponding key tags.

[0057] In this embodiment, steps S250 and 260 compare the key similarity between texts by analyzing the keyness of word segments. Addressing the problem of fragmented synonym queries in existing technologies, this similarity calculation process evaluates the importance of each word segment from a business perspective, deeply integrates a pre-defined domain knowledge base, and employs a Conditional Random Field (CRF) to perform keyness analysis on each first and second word segment. The CRF is a discriminative probabilistic model used for sequence labeling, which combines contextual information to predict the most suitable label (such as entity type or key category) for each word in the text. By utilizing the contextual features of words, the CRF accurately identifies entities belonging to a specific domain from the text, such as product names, component names, and brand names. The domain knowledge base predefines the types of these entities and the synonym mapping relationships between them (such as "CPU" and "central processing unit").

[0058] In this embodiment of the disclosure, based on the CRF recognition results and knowledge base queries, the system assigns key tags to each token in the first and second word segments. These tags are typically divided into several categories, such as "core words" identifying core business entities, "common words" identifying general vocabulary, and "synonyms" identifying words with synonymous relationships. Through this step, the algorithm no longer treats all word segments equally, but instead tags the keyness of word segments at the granularity of word segmentation.

[0059] In step S260, the key similarity between the first text and the second text is calculated based on the key tags corresponding to each of the first and second segmented words.

[0060] In this embodiment of the disclosure, based on the aforementioned key labels of each word segmentation marker obtained through a conditional random field, different weight values are assigned to the keyness of different word segments, and on this basis, the key similarity between the first text and the second text is calculated. This key similarity reflects the similarity between word segments and corresponding word segments in the two texts. Unlike the aforementioned semantic similarity, which focuses on the syntactic structure of the text, this key similarity focuses on the similarity between corresponding word segments.

[0061] In an exemplary embodiment, several word pairs are constructed based on the correspondence between the words in the first word set and the second word set. Each word pair includes a first word and / or a second word that are in a corresponding relationship. When two word pairs are in a corresponding relationship in the two word sets, these two word pairs constitute a set of word pairs. When a word in one word set does not have a corresponding word in the other word set, this word pair, together with a blank space, constitutes a set of word pairs.

[0062] When a word segmentation pair includes corresponding words from two word segmentation sets, the key similarity of the word segmentation pair is calculated based on the similarity between the two word segments and their corresponding key tags.

[0063] When a word segmentation pair consists of a word segment and a whitespace, the key similarity of the word segmentation pair is calculated based on the similarity between the word segment and the whitespace, as well as the key tag of the word segment.

[0064] Based on the key similarity corresponding to each of the above word segmentation pairs, the key similarity between the first text and the second text is determined.

[0065] In step S270, a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity are determined based on the comparison between the key similarity and the fusion threshold.

[0066] In this embodiment, semantic similarity between texts is obtained by analyzing the syntactic structure of the texts in steps S230 and 240; critical similarity between texts is obtained by analyzing the word segmentation criticality in steps S250 and 260. Semantic similarity and critical similarity measure the similarity between two texts based on their syntactic structure and corresponding word segmentation similarity, respectively. Therefore, it is necessary to dynamically fuse semantic similarity and critical similarity according to the actual context, that is, to determine a first fusion weight corresponding to semantic similarity and a second fusion weight corresponding to critical similarity, thereby calculating the final similarity.

[0067] In this embodiment of the disclosure, the fusion weight between the two similarities is determined based on a comparison between the key similarity and the fusion threshold. When the key similarity is greater than the fusion threshold, it indicates that the two texts are highly matched in core word segmentation, and therefore a second fusion weight with higher key similarity should be assigned. Conversely, when the key similarity is less than the fusion threshold, it indicates that the two texts are significantly different in core word segmentation, and therefore a first fusion weight with higher semantic similarity should be assigned.

[0068] This weighting mechanism gives the method strong adaptability to different scenarios. For example, in a reporting system that needs to accurately distinguish different business objects, high key similarity will trigger the "word segmentation-driven" mode, effectively avoiding misjudgments caused by similar sentence structures; while in open-ended question answering, when word segmentation matching is not high, the system will switch to the "syntactic-driven" mode to better capture semantically related generalized queries.

[0069] In step S280, the similarity between the first text and the second text is calculated based on the first fusion weight, the second fusion weight, semantic similarity, and key similarity.

[0070] In this embodiment of the disclosure, the similarity between the first text and the second text is calculated by weighted fusion based on the first fusion weight, the second fusion weight, the semantic similarity, and the key similarity determined above.

[0071] Specifically, semantic similarity is weighted based on the first fusion weight, and key similarity is weighted based on the second fusion weight. The similarity is obtained by weighting.

[0072] In an exemplary embodiment, the similarity can be calculated using the following formula:

[0073]

[0074] in, The similarity between the first and second texts; For semantic similarity; Key similarity; The first fusion weight; This is the second fusion weight.

[0075] The text similarity matching method provided in this disclosure obtains a first text and a second text. On one hand, it uses a syntactic analyzer to mark the syntactic attributes of each word segmentation tag, thereby calculating the semantic similarity between the two texts. On the other hand, it uses a conditional random field to mark the key tags of each word segmentation tag, thereby calculating the key similarity between the two texts. Based on a comparison between the key similarity and a fusion threshold, a first fusion weight and a second fusion weight are determined. Based on the first fusion weight, the second fusion weight, the semantic similarity, and the key similarity, the similarity between the first text and the second text is calculated. This method solves the problem of misjudgment caused by confusing sentence structure with entity semantics by calculating semantic similarity and key similarity separately, thus improving the robustness and accuracy of text similarity matching in complex business scenarios.

[0076] Figure 3 A flowchart of a syntactic attribute tagging method according to an embodiment of the present disclosure is shown. In this embodiment, in... Figure 2 Based on the text similarity matching method shown, step S230 may include the following steps.

[0077] In step S310, the first text and the second text are syntactically analyzed by the syntactic analyzer to obtain the dependency relation types corresponding to each of the first and second segments, as well as the headword index and dependency word index of the first and second texts.

[0078] In this embodiment of the disclosure, a syntactic analyzer (such as Stanford CoreNLP) is used to perform deep syntactic analysis on the first and second texts, transforming the linear word sequence into a structured dependency tree.

[0079] In an exemplary embodiment, the parsing process aims to identify the syntactic relationships between each word segment in the text, and its output may be a set of triples. The set of, where It represents a specific type of dependency relationship (such as subject-verb relation nsubj, verb-object relation dobj, etc.). It is a central word index, and This is the dependency index.

[0080] Through this step, the system gains a deep understanding of the text's grammatical structure, going beyond simple word order and bag-of-words models. The headword index and dependency word index provide precise coordinates for subsequent structured processing and vectorization, enabling the algorithm to track the position and role of each grammatical component in the original sentence.

[0081] In step S320, the first word segmentation set and the second word segmentation set are filtered according to the dependency relationship type.

[0082] In this embodiment, the first and second word segmentation sets are filtered based on the dependency relationship types obtained in step S310. The aim is to identify the core components constituting the sentence's main structure from a complex syntactic network, eliminating decorative and auxiliary words that contribute little to the overall semantic framework. The filtering rules are typically based on a pre-defined "core relation set." Here, nsubj represents the subject, dobj represents the object, root represents the root node, and attr represents the modifier, among other key dependency relationship types.

[0083] This process effectively achieves "semantic isolation," stripping away rhetorical elements and redundant information, allowing subsequent processing to focus on the core structure of the sentence, such as subject, verb, and object. This purified structural representation greatly reduces the interference of non-keywords in similarity calculation.

[0084] In step S330, based on the dependency relation type, the headword index, and the dependency word index, the first and second segments in the filtered first and second segment sets are respectively labeled with corresponding syntactic attributes.

[0085] In this embodiment of the disclosure, based on the determined dependency relationship type, headword index, and dependency word index, each word segment is labeled with its corresponding syntactic attributes in the syntactic structure. These syntactic attributes include not only its own dependency relations (such as "subject" and "object"), but may also include its depth in the syntactic tree, its distance from the root node, and its association with other core components. By labeling the corresponding syntactic attributes, the rich structured contextual information of each word in the text is reflected.

[0086] For example, for the first text "Query the 5G chip performance parameters of brand AA product BB in Q3 2023", after syntactic analysis by the syntactic analyzer, each word segment is marked with syntactic attributes [action: query; time: Q3 2023; subject: product BB; attribute: 5G chip performance parameters].

[0087] For example, for the second text "Get the specifications of the fifth-generation mobile communication processor of product BB in the third quarter", after syntactic analysis by the syntactic analyzer, each word segment is marked with syntactic attributes [action: get; time: third quarter; subject: product BB; attribute: specifications of the fifth-generation mobile communication processor].

[0088] Figure 4 A flowchart of a semantic similarity matching method according to an embodiment of the present disclosure is shown. In this embodiment of the present disclosure, in... Figure 2Based on the text similarity matching method shown, step S240 may include the following steps.

[0089] In step S410, a first feature vector corresponding to the first text is generated based on the syntactic attributes of each of the first word segments and the first text.

[0090] In this embodiment, based on the syntactic attributes of each first segmentation word and the first text, relevant semantic features are extracted to generate a first feature vector corresponding to the first text. The input includes two parts: the syntactic attributes corresponding to each first segmentation word and the original first text. By combining the syntactic attributes marked on each first segmentation word, features are extracted from the first text to generate the first feature vector corresponding to the first text.

[0091] In an exemplary embodiment, a BERT model is fine-tuned based on a syntactic tree dataset as the encoder. Through this fine-tuned BERT model, the system can gain a deeper understanding of the functional semantics of various syntactic structures and encode this structured syntactic information into a dense vector of fixed dimensions (e.g., 768 dimensions). This involves setting the syntactic skeleton relations of the first text. As input to the model, the first feature vector corresponding to the first text is obtained. The above feature vector transformation process can be represented by the following formula.

[0092]

[0093] In an exemplary embodiment, syntactic structure features can be further introduced during the feature vector generation process. The syntactic attributes of each of the aforementioned first word segments reflect the syntactic attributes of each word segment in the text. Syntactic structure features, on the other hand, reflect the structural features of the text in terms of syntactic structure. The first text can be analyzed syntactically using a syntactic structure analysis module to obtain the first syntactic structure features. This syntactic structure analysis module is used to analyze the syntactic structure of the text. Based on the aforementioned embodiment, a multi-head self-attention mechanism can be used to introduce syntactic structure features and the syntactic attributes of word segments during the feature vector generation process to extract features from the first text and generate a first feature vector corresponding to the first text. The above feature vector transformation process can be represented by the following formula.

[0094]

[0095] In step S420, a second feature vector corresponding to the second text is generated based on the syntactic attributes of each of the second segmented words and the second text.

[0096] Referring to the method for generating the first feature vector in step S410 above, a second feature vector corresponding to the second text can be generated based on the syntactic attributes of each second word and the second text. The relevant processing procedures have been described in the preceding steps and will not be repeated here.

[0097] In step S430, the semantic similarity between the first text and the second text is calculated based on the first feature vector and the second feature vector.

[0098] In this embodiment of the disclosure, based on the aforementioned first feature vector and second feature vector, a cosine similarity algorithm is used to calculate the cosine distance between the two feature vectors, which is taken as the semantic similarity between the first text and the second text. Specifically, it can be expressed as the following formula.

[0099]

[0100] in, For semantic similarity; This is the first eigenvector; This is the second eigenvector.

[0101] Figure 5 A flowchart illustrating a key labeling method according to an embodiment of this disclosure is shown. In this embodiment, in... Figure 2 Based on the text similarity matching method shown, step S250 may include the following steps.

[0102] In step S510, based on a preset domain knowledge base, the first and second word segments are analyzed for key characteristics using a conditional random field, and the key tags of each of the first and second word segments are marked as core words or ordinary words.

[0103] In this embodiment, the Conditional Random Field (CRF) leverages its powerful sequence labeling capabilities to analyze the contextual features of each segment, accurately identifying words in the text that belong to specific domain entities, such as product names, component names, and brand names. It then assigns an initial key label to each identified first and second segment. This label primarily distinguishes between "core words" and "ordinary words," and its determination directly relies on a predefined set of entities in the domain knowledge base.

[0104] For example, in the telecommunications field, words such as "brand AA", "product BB", and "5G chip" will be identified and marked as "core words" by CRF, while general words such as "query", "quantity", and "parameter" will be marked as "ordinary words".

[0105] In step S520, in response to the first word segmentation and the second word segmentation satisfying the synonym mapping relationship in the domain knowledge base, the key tags of the first word segmentation and the second word segmentation are marked as synonyms; the domain knowledge base stores a set of synonym mapping relationships.

[0106] In this embodiment, based on the aforementioned labeling of "core words" and "ordinary words," the key tags of words with synonym mapping relationships in the first and second word segmentation sets are further labeled as synonyms based on the synonym mapping relationships in the domain knowledge base. By labeling synonyms, words that appear different but are semantically equivalent are associated and given special labels. This ensures that in subsequent similarity calculations, text pairs such as "CPU specifications" and "central processing unit specifications" will not be misjudged as having low similarity due to differences in the word forms of their core entities.

[0107] In an exemplary embodiment, the domain knowledge base k may include: a set of domain entities. Mapping relationship with synonyms The domain entity set includes related terminology entities used for terminology identification. The synonym mapping includes the mapping relationships between synonyms in the domain. For example, CPU and Central Processing Unit. The synonym mapping may further include synonym weights, which quantify the semantic relevance between synonyms.

[0108] Through the above steps S510 and S520, the process of performing key analysis on the first and second word segments using conditional random fields can be expressed as the following formula.

[0109]

[0110] in, For each word segmentation; For key tags defined by the domain knowledge base, for example, ;in As the core word, It is a common word. They are synonyms; This indicates the position of the word segment in the text.

[0111] For example, for the first text "Query the 5G chip performance parameters of brand AA product BB in Q3 2023", after criticality analysis using a conditional random field, key tags are assigned to each word segment, as follows:

[0112]

[0113] For example, for the second text "Retrieve the specification data of the fifth-generation mobile communication processor of product BB in the third quarter", after criticality analysis using a conditional random field, key tags are assigned to each word segment, as follows:

[0114]

[0115] Figure 6 A flowchart of a key similarity matching method according to an embodiment of this disclosure is shown. In this embodiment, in... Figure 2 Based on the text similarity matching method shown, step S260 may include the following steps.

[0116] In step S610, several word pairs are constructed based on the correspondence between the words in the first word set and the second word set; the word pairs include the first word and / or the second word that are in a corresponding relationship.

[0117] In this embodiment of the disclosure, a preprocessing process is included before the critical similarity calculation, which aims to systematically establish the correspondence between the words in the first word segmentation set and the second word segmentation set. Specifically, this process is accomplished by constructing "word segmentation pairs", that is, pairing the first word in the first text with the corresponding second word in the second text to form several word segmentation pairs.

[0118] In an exemplary embodiment, the word segmentation pair includes a first word and / or a second word that are in a corresponding relationship. When two word segmentation sets include two words that are in a corresponding relationship, then the two words constitute a word segmentation pair. When one word in one word segmentation set does not have a corresponding word in the other word segmentation set, then that word and a blank constitute a word segmentation pair.

[0119] As exemplified by the first and second texts mentioned above, based on the correspondence between the words in their first and second word segmentation sets, a word segmentation pair {Brand AA,} is constructed. }, {Product BB, Product BB}, {5G chip, fifth-generation mobile communication processor}.

[0120] In step S620, the key weights are determined based on the key tags corresponding to the word segments in the word segmentation pair.

[0121] In this embodiment of the disclosure, based on the construction of word segmentation pairs, a key weight is determined according to the key tags corresponding to the words in the word segmentation pair. This weight directly reflects the importance of the word segmentation pair in the overall similarity assessment.

[0122] In an exemplary embodiment, the key weight allocation strategy is based on preset rules: In response to the key tag corresponding to a word segment in the word segmentation pair being a core word, the key weight is determined as a first weight; in response to the key tag corresponding to a word segment in the word segmentation pair being a common word, the key weight is determined as a second weight; in response to the key tag corresponding to a word segment in the word segmentation pair being a synonym, the key weight is determined as a third weight. This can be expressed by the following formula:

[0123]

[0124] in, The first weight corresponds to the core word. The second weight corresponds to ordinary words. The third weight corresponds to the synonym.

[0125] The first weight is greater than the second weight, which is greater than the third weight. For example, the first weight is 0.9, the second weight is 0.3, and the third weight is 0.1.

[0126] In an exemplary embodiment, the synonym mapping relationship in the domain knowledge base as described above may further include synonym weights between each synonym. Based on this, the third weight can be determined through the following steps.

[0127] Based on the synonym mapping relationship in the domain knowledge base, determine the synonym weights corresponding to the words in the word segmentation pair;

[0128] The third weight is determined based on the synonym weights.

[0129] The core idea of this mechanism is to transform domain knowledge into quantifiable influence parameters. Through differentiated weight allocation, the algorithm can "differentiate" different word matches when calculating similarity. Whether or not the core word is matched will have a decisive impact on the final result, while the matching of synonyms can contribute positive benefits without causing false positives. By reducing the weight of synonyms, the impact of similarity differences introduced by the differences in word segmentation between related synonyms on the final similarity calculation is reduced, thereby eliminating the impact of different expressions between synonyms on the similarity calculation.

[0130] In step S630, the similarity between the word segments in the word segmentation pair is calculated to obtain the word segmentation similarity.

[0131] In this embodiment of the disclosure, based on the aforementioned determined word segmentation pairs, the similarity between two words in the word segmentation pair is calculated to obtain the word segmentation similarity. This can be expressed by the following formula:

[0132]

[0133] in, For the two participles in a word pair, For the first word segmentation set, This is the second set of word segments. , .

[0134] It should be noted that this word segmentation similarity is based solely on the semantic similarity between the two word segments and does not consider the influence of the aforementioned key tags. There are many methods for matching the similarity between related words, and this disclosure does not limit them.

[0135] In an exemplary embodiment, if the words in a word segmentation pair are synonyms, their word segmentation similarity can be a fixed value, such as 1. If the word segmentation pair is a word segmentation pair consisting of a word and a space, its word segmentation similarity can be a fixed value, such as 0.

[0136] In step S640, the word segmentation similarity of the word segmentation pair is calculated based on the key weight and the word segmentation similarity.

[0137] In this embodiment, the results of the first two steps are merged to calculate the corresponding word segmentation similarity for each word segmentation pair. Specifically, the word segmentation similarity can be obtained by multiplying the key weight and the word segmentation similarity. This can be expressed by the following formula:

[0138]

[0139] For example, word segmentation pair {brand AA, The word segmentation similarity for {product BB, product BB} is 0, the key weight (core word) is 0.9, and the word segmentation pair similarity is 0.9. The word segmentation pair {5G chip, fifth generation mobile communication processor} has a word segmentation similarity (synonym) of 1, a key weight (synonym) of 0.1, and the word segmentation pair similarity is 0.1.

[0140] In step S650, the key similarity between the first text and the second text is determined based on the word pair similarity of each word pair.

[0141] In this embodiment of the disclosure, the overall critical similarity between the first text and the second text is determined based on the similarity of the word pairs corresponding to each word pair. It should be noted that there are many ways to calculate the overall critical similarity between two texts by aggregating the word pair similarities of each word pair, and this disclosure does not limit such methods.

[0142] In an exemplary embodiment, the key similarity between the first text and the second text can be obtained by weighting the word segmentation similarity of each word segmentation pair. This can be expressed by the following formula:

[0143]

[0144] in, For the two participles in a word pair, For the first word segmentation set, For the second set of word segmentation, For word segmentation similarity, As a key weight, Weights are assigned to word segmentation pairs.

[0145] In an exemplary embodiment, the key similarity between the first text and the second text can be determined based on the maximum value of the word segmentation similarity among the respective word segmentation pairs. This can be expressed by the following formula:

[0146]

[0147] in, For the two participles in a word pair, For the first word segmentation set, For the second set of word segmentation, For word segmentation similarity, As a critical weight.

[0148] Figure 7 A flowchart of a similarity fusion calculation method according to an embodiment of the present disclosure is shown. In this embodiment, in... Figure 2 Based on the text similarity matching method shown, step S270 may include the following steps.

[0149] In step S710, in response to the critical similarity being greater than the fusion threshold, it is determined that the first fusion weight is less than the second fusion weight.

[0150] In this embodiment, the fusion weight between the two similarities is determined by comparing the key similarity with a fusion threshold. When the key similarity is greater than the fusion threshold, it indicates that the two texts are highly matched in core word segmentation, and therefore a higher second fusion weight should be assigned to the key similarity. Conversely, when the key similarity is less than the fusion threshold, it indicates that the two texts differ significantly in core word segmentation, and therefore a lower first fusion weight than a second fusion weight. For example, the first fusion weight is 0.2, and the second fusion weight is 0.8.

[0151] In an exemplary embodiment, the value of the fusion threshold can also be dynamically determined based on the conditions of the first word segmentation set and the second word segmentation set. Specifically, this may include the following steps.

[0152] The proportion of shared words is determined based on the first word segmentation set and the second word segmentation set;

[0153] The fusion threshold is determined based on the proportion of shared word segments.

[0154] In this embodiment, the fusion threshold is determined by statistically analyzing the proportion of shared words in the first and second word segments. This can be expressed by the following formula:

[0155]

[0156] in, The fusion threshold, This is the initial value for the fusion threshold. For smoothing coefficients, There are shared words between the two word segmentation sets. The total number of words is divided into two word sets.

[0157] In step S720, in response to the critical similarity being less than the fusion threshold, it is determined that the first fusion weight is greater than the second fusion weight.

[0158] In this embodiment of the disclosure, when the key similarity is less than the fusion threshold, it indicates that the two texts differ significantly in core word segmentation, and therefore a first fusion weight with higher semantic similarity should be assigned. For example, the first fusion weight is 0.8, and the second fusion weight is 0.2.

[0159] Based on the same inventive concept, this disclosure provides a text similarity matching device, as described in the following embodiments. Since the principle by which this device solves the problem is similar to that of the method embodiments described above, repeated details will not be repeated.

[0160] Figure 8 A schematic diagram of the structure of a text similarity matching device according to an embodiment of this disclosure is shown. Figure 8 As shown, the text similarity matching device 800 may include: a text acquisition module 810, a word segmentation module 820, a syntactic analysis module 830, a semantic similarity module 840, a keyness analysis module 850, a keyness similarity module 860, a fusion weight module 870, and a similarity calculation module 880.

[0161] The text acquisition module 810 is configured to acquire the first text and the second text.

[0162] The word segmentation module 820 is configured to perform word segmentation processing on the first text and the second text to obtain a first word segmentation set and a second word segmentation set, respectively; the first word segmentation set includes a plurality of first words; the second word segmentation set includes a plurality of second words;

[0163] The syntactic analysis module 830 is configured to perform syntactic analysis on the first text and the second text respectively through a syntactic analyzer, and to mark the corresponding syntactic attributes on the first word segment and the second word segment respectively.

[0164] The semantic similarity module 840 is configured to calculate the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first and second segmented words;

[0165] The key analysis module 850 is configured to perform key analysis on the first and second word segments based on a preset domain knowledge base using a conditional random field, and to mark the first and second word segments with corresponding key tags respectively.

[0166] The key similarity module 860 is configured to calculate the key similarity between the first text and the second text based on the key tags corresponding to each of the first and second segmented words;

[0167] The fusion weight module 870 is configured to determine a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity based on a comparison between the key similarity and the fusion threshold.

[0168] The similarity calculation module 880 is configured to calculate the similarity between the first text and the second text based on the first fusion weight, the second fusion weight, semantic similarity, and key similarity.

[0169] In an exemplary embodiment, the syntactic analysis module 830 is further configured to perform syntactic analysis on the first text and the second text respectively using a syntactic analyzer to obtain the dependency relation types corresponding to each of the first and second segment words, as well as the head word index and dependency word index of the first and second texts; filter the first segment word set and the second segment word set according to the dependency relation types; and mark the first and second segment words in the filtered first and second segment word sets with corresponding syntactic attributes according to the dependency relation types, head word indexes, and dependency word indexes.

[0170] In an exemplary embodiment, the semantic similarity module 840 is further configured to generate a first feature vector corresponding to the first text based on the syntactic attributes of each of the first segmented words and the first text; generate a second feature vector corresponding to the second text based on the syntactic attributes of each of the second segmented words and the second text; and calculate the semantic similarity between the first text and the second text based on the first feature vector and the second feature vector.

[0171] In an exemplary embodiment, the semantic similarity module 840 is further configured to perform syntactic structure analysis on the first text and the second text through the syntactic structure analysis module to obtain first syntactic structure features and second syntactic structure features respectively; generate a first feature vector corresponding to the first text based on the syntactic attributes of each of the first segmented words, the first syntactic structure features and the first text; and generate a second feature vector corresponding to the second text based on the syntactic attributes of each of the second segmented words, the second syntactic structure features and the second text.

[0172] In an exemplary embodiment, the keyness analysis module 850 is further configured to perform keyness analysis on the first and second word segments using a conditional random field based on a preset domain knowledge base, and to mark the key tags of each of the first and second word segments as core words or ordinary words; in response to the first and second word segments satisfying the synonym mapping relationship in the domain knowledge base, to mark the key tags of the first and second word segments as synonyms; the domain knowledge base stores a set of synonym mapping relationships.

[0173] In an exemplary embodiment, the key similarity module 860 is further configured to construct several word segments based on the correspondence between each word in the first word segmentation set and the second word segmentation set; each word segmentation pair includes a first word and / or a second word that are in a corresponding relationship; determine key weights based on the key tags corresponding to the words in the word segmentation pair; calculate the similarity between the words in the word segmentation pair to obtain word segmentation similarity; calculate the word segmentation pair similarity based on the key weights and word segmentation similarity; and determine the key similarity between the first text and the second text based on the word segmentation pair similarity of each word segmentation pair.

[0174] In an exemplary embodiment, the key similarity module 860 is further configured to determine the key weight as a first weight in response to the key tag corresponding to the word segment in the word segmentation pair being a core word; to determine the key weight as a second weight in response to the key tag corresponding to the word segment in the word segmentation pair being a common word; and to determine the key weight as a third weight in response to the key tag corresponding to the word segment in the word segmentation pair being a synonym; wherein the first weight is greater than the second weight, which is greater than the third weight.

[0175] In an exemplary embodiment, the key similarity module 860 is further configured to determine the synonym weights corresponding to the words in the word segmentation pair based on the synonym mapping relationship in the domain knowledge base; and to determine the third weight based on the synonym weights.

[0176] In an exemplary embodiment, the key similarity module 860 is further configured to perform a weighted calculation based on the word pair similarity of each of the word pairs to obtain the key similarity between the first text and the second text.

[0177] In an exemplary embodiment, the key similarity module 860 is further configured to determine the key similarity between the first text and the second text based on the maximum value among the word pair similarities of each of the word pair.

[0178] In an exemplary embodiment, the fusion weight module 870 is further configured to determine that the first fusion weight is less than the second fusion weight in response to the key similarity being greater than the fusion threshold; and to determine that the first fusion weight is greater than the second fusion weight in response to the key similarity being less than the fusion threshold.

[0179] In an exemplary embodiment, the fusion weight module 870 is further configured to determine the proportion of shared words based on the first word segmentation set and the second word segmentation set; and to determine the fusion threshold based on the proportion of shared words.

[0180] Figure 9 A schematic diagram of the structure of an electronic device suitable for implementing exemplary embodiments of the present disclosure is shown. Referring below... Figure 9 To describe an electronic device 900 according to this embodiment of the present invention. Figure 9 The electronic device 900 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of the present invention.

[0181] like Figure 9 As shown, the electronic device 900 is presented in the form of a general-purpose computing device. The components of the electronic device 900 may include, but are not limited to: at least one processing unit 910, at least one storage unit 920, a bus 930 connecting different system components (including storage unit 920 and processing unit 910), and a display unit 940.

[0182] Storage unit 920 may include readable media in the form of volatile storage units, such as random access memory (RAM) 9201 and / or cache memory 9202, and may further include read-only memory (ROM) 9203.

[0183] The storage unit 920 may also include a program / utility 9204 having a set (at least one) program module 9205, such program module 9205 including but not limited to: an operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.

[0184] Bus 930 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the various bus structures.

[0185] Electronic device 900 can also communicate with one or more external devices 970 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 900, and / or with any device that enables electronic device 900 to communicate with one or more other computing devices (e.g., router, modem, etc.). This communication can be performed via input / output (I / O) interface 950. Furthermore, electronic device 900 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 960. As shown, network adapter 960 communicates with other modules of electronic device 900 via bus 930. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0186] In exemplary embodiments of this disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the methods described above is stored.

[0187] In some possible implementations, various aspects of the present invention may also be implemented as a program product comprising program code that, when the program product is run on a terminal device, causes the terminal device to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of the present invention.

[0188] According to embodiments of the present invention, a program product for implementing the above-described method may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, a readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with a signaling execution system, apparatus, or device.

[0189] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0190] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable signal medium may also be any readable medium other than a readable storage medium, capable of sending, propagating, or transmitting programs for use by or in conjunction with a signaling execution system, apparatus, or device.

[0191] The program code contained on the readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof.

[0192] Program code for performing the operations of this invention can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, and conventional procedural programming languages such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0193] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to embodiments of this disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0194] Furthermore, although the steps of the method in this disclosure are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple steps.

[0195] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several signaling instructions to cause a computing device (such as a personal computer, server, mobile terminal, or network device, etc.) to execute the method according to the embodiments of this disclosure.

[0196] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the appended claims.

[0197] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. A text similarity matching method, characterized in that, The method includes: Get the first text and the second text; The first text and the second text are segmented into words to obtain a first word segmentation set and a second word segmentation set, respectively; the first word segmentation set includes several first words; the second word segmentation set includes several second words. The first text and the second text are syntactically analyzed by a syntactic analyzer, and the first and second word segments are marked with corresponding syntactic attributes. Calculate the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first and second segmented words; Based on a pre-defined domain knowledge base, a conditional random field is used to perform key analysis on the first and second word segments, and the first and second word segments are respectively labeled with corresponding key tags; Calculate the key similarity between the first text and the second text based on the key tags corresponding to each of the first and second word segments; Based on the comparison between the key similarity and the fusion threshold, a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity are determined; Based on the first fusion weight, the second fusion weight, semantic similarity, and key similarity, the similarity between the first text and the second text is calculated; The preset domain knowledge base performs key analysis on the first and second word segments using a conditional random field, and marks the first and second word segments with corresponding key tags, including: Based on a pre-defined domain knowledge base, the first and second word segments are analyzed using a conditional random field, and the key tags of each first and second word segment are marked as core words or ordinary words. In response to the first and second word segments satisfying the synonym mapping relationship in the domain knowledge base, the key tags of the first and second word segments are marked as synonyms; the domain knowledge base stores a set of synonym mapping relationships; The step of calculating the key similarity between the first text and the second text based on the key tags corresponding to each of the first and second word segments includes: Based on the correspondence between the words in the first and second word segments, several word segments are constructed; each word segment pair includes the first and / or second words that are in a corresponding relationship. Based on the key tags corresponding to the word segments in the word segmentation pair, determine the key weights; Calculate the similarity between the word segments in the word segmentation pair to obtain the word segmentation similarity. Based on the key weights and word segmentation similarity, the word segmentation similarity of the word segmentation pairs is calculated. Based on the word segmentation similarity of each of the aforementioned word segmentation pairs, the critical similarity between the first text and the second text is determined.

2. The method according to claim 1, characterized in that, The step of performing syntactic analysis on the first text and the second text using a syntactic analyzer, and marking the corresponding syntactic attributes of the first and second word segments respectively, includes: The first text and the second text are syntactically analyzed by the syntactic analyzer to obtain the dependency relation types corresponding to each first word segment and the second word segment, as well as the head word index and dependency word index of the first text and the second text. The first and second word segments are filtered according to the dependency relationship type. Based on the dependency relation type, the headword index, and the dependency word index, the first and second segment words in the filtered first and second segment word sets are respectively labeled with their corresponding syntactic attributes.

3. The method according to claim 1, characterized in that, The step of calculating the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first and second segmented words includes: Based on the syntactic attributes of each of the first word segments and the first text, a first feature vector corresponding to the first text is generated; Based on the syntactic attributes of each of the second word segments and the second text, a second feature vector corresponding to the second text is generated; Based on the first feature vector and the second feature vector, calculate the semantic similarity between the first text and the second text.

4. The method according to claim 3, characterized in that, The method further includes: The first and second texts are analyzed using the syntactic structure analysis module to obtain the first and second syntactic structure features, respectively. Based on the syntactic attributes, first syntactic structure features, and the first text of each first word segment, a first feature vector corresponding to the first text is generated; Based on the syntactic attributes, second syntactic structure features, and the second text of each second word segment, a second feature vector corresponding to the second text is generated.

5. The method according to claim 1, characterized in that, The step of determining key weights based on the key tags corresponding to the word segments in the word segmentation pair includes: In response to the fact that the key tags corresponding to the word segments in the word segmentation pair are core words, the key weight is determined as the first weight; In response to the fact that the key tag corresponding to the word segment in the word segmentation pair is a common word, the key weight is determined to be the second weight; In response to the fact that the key tag corresponding to the word segment in the word segmentation pair is a synonym, the key weight is determined to be the third weight; Wherein, the first weight is greater than the second weight, which is greater than the third weight.

6. The method according to claim 5, characterized in that, The response that the key tag corresponding to the word segment in the word segmentation pair is a synonym, and the determination of the key weight as a third weight, includes: Based on the synonym mapping relationship in the domain knowledge base, determine the synonym weights corresponding to the words in the word segmentation pair; The third weight is determined based on the synonym weights.

7. The method according to claim 1, characterized in that, The step of determining the key similarity between the first text and the second text based on the word segmentation pair similarity of each of the word segmentation pairs includes: The key similarity between the first text and the second text is obtained by weighting the word segmentation similarity of each word segmentation pair.

8. The method according to claim 1, characterized in that, The step of determining the key similarity between the first text and the second text based on the word segmentation pair similarity of each of the word segmentation pairs includes: The key similarity between the first text and the second text is determined based on the maximum value of the word segmentation similarity among each of the word segmentation pairs.

9. The method according to claim 1, characterized in that, The step of determining a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity based on a comparison of the key similarity and the fusion threshold includes: In response to the critical similarity being greater than the fusion threshold, it is determined that the first fusion weight is less than the second fusion weight; In response to the critical similarity being less than the fusion threshold, the first fusion weight is determined to be greater than the second fusion weight.

10. The method according to claim 9, characterized in that, The method further includes: The proportion of shared words is determined based on the first word segmentation set and the second word segmentation set; The fusion threshold is determined based on the proportion of shared word segments.

11. A text similarity matching device, characterized in that, include: The text acquisition module is configured to acquire the first text and the second text. The word segmentation module is configured to perform word segmentation on the first text and the second text to obtain a first word segmentation set and a second word segmentation set, respectively; the first word segmentation set includes a number of first words; the second word segmentation set includes a number of second words. The syntactic analysis module is configured to perform syntactic analysis on the first text and the second text respectively through a syntactic analyzer, and to mark the corresponding syntactic attributes on the first segment and the second segment respectively; The semantic similarity module is configured to calculate the semantic similarity between the first text and the second text based on the syntactic attributes corresponding to each of the first and second segmented words; The key analysis module is configured to perform key analysis on the first and second word segments based on a preset domain knowledge base using a conditional random field, and to mark the first and second word segments with corresponding key tags respectively; The key similarity module is configured to calculate the key similarity between the first text and the second text based on the key tags corresponding to each of the first and second segmented words; The fusion weight module is configured to determine a first fusion weight corresponding to the semantic similarity and a second fusion weight corresponding to the key similarity based on a comparison between the key similarity and the fusion threshold. The similarity calculation module is configured to calculate the similarity between the first text and the second text based on the first fusion weight, the second fusion weight, semantic similarity, and key similarity. The key analysis module is further configured to perform key analysis on the first and second word segments using a conditional random field based on a preset domain knowledge base, marking the key tags of each of the first and second word segments as core words or ordinary words; in response to the first and second word segments satisfying the synonym mapping relationship in the domain knowledge base, marking the key tags of the first and second word segments as synonyms; the domain knowledge base stores a set of synonym mapping relationships; The key similarity module is further configured to construct several word pairs based on the correspondence between each word in the first word set and the second word set; each word pair includes a corresponding first word and / or second word; and to determine key weights based on the key tags corresponding to the words in the word pair. Calculate the similarity between words in the word segmentation pair to obtain the word segmentation similarity; calculate the word segmentation pair similarity of the word segmentation pair based on the key weight and the word segmentation similarity; determine the key similarity between the first text and the second text based on the word segmentation pair similarity of each word segmentation pair.

12. An electronic device, characterized in that, include: One or more processors; A storage device configured to store one or more programs, which, when executed by one or more processors, cause the one or more processors to implement the method as described in any one of claims 1 to 10.

13. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 10.

14. A computer program product, comprising a computer program / signaling, characterized in that, When the computer program / signaling is executed by the processor, it implements the method as described in any one of claims 1 to 10.