A patent matching method and apparatus based on user needs
By using a patent feature extraction model based on the Transformer architecture and the dynamic BM25 algorithm, the problems of low accuracy and efficiency in existing patent search technologies are solved, achieving efficient and accurate patent matching and outputting a high-quality collection of patent texts.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING ZHONGZHI SMART TECH CO LTD
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-30
AI Technical Summary
Existing patent search technologies rely on the completeness and accuracy of user input, making it difficult to understand the user's true technical needs. This results in a lot of noisy data and low accuracy in the search results, making it impossible to quickly obtain patents that highly match the user's needs, thus leading to low search efficiency.
A bidirectional encoder representation model based on the Transformer architecture (BERT) is adopted, combined with a term frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism, to generate a modular fusion feature vector matrix of patent texts. The patent feature library is constructed and accurately matched by dynamically adjusting the parameters of the BM25 algorithm.
It improves the accuracy and efficiency of patent searches, reduces missed and false detections, and outputs a collection of patent texts that precisely meet user needs, reducing the time cost for users to select patents and enhancing the user experience.
Smart Images

Figure CN122309723A_ABST
Abstract
Description
Technical Field
[0001] The embodiments of the present invention relate to the field of big data technology, and in particular to a patent matching method and apparatus based on user needs. Background Technology
[0002] In patent information retrieval and matching scenarios, existing conventional retrieval methods typically rely solely on user-inputted technical requirements. They use general word segmentation tools for simple word segmentation, breaking down user needs into basic keywords before directly conducting a full-text search in patent databases. This retrieval model over-relies on the completeness and accuracy of user input, and is limited by the basic word segmentation tools' ability to recognize patent-related terminology, technical features, and contextual semantics. This makes it difficult to accurately understand the user's true technical demands and the core technical content of patent documents. The retrieval process is prone to issues such as keyword matching deviations and missing semantic understanding, resulting in a large amount of irrelevant patent information, high noise levels, and low accuracy in the returned results. Users often need to sift through massive amounts of search results, leading to low retrieval efficiency and an inability to quickly and reliably obtain patents that highly match their technical needs. This fails to meet the requirements of users in practical application scenarios for accurate and efficient patent retrieval.
[0003] The background section described above is merely a description made by the inventor based on his understanding, and the above content should not be regarded as evidence of prior art disclosed before the filing date of this application. Summary of the Invention
[0004] This invention provides a patent matching method based on user needs to improve retrieval efficiency and the matching degree between patent retrieval results and user needs. The method includes: Historical patent texts are input into the patent feature extraction model to generate multiple module fusion feature vector matrices for each historical patent text. All module fusion feature vector matrices are vectorized and stored in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture and integrates a term frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. After obtaining user needs, the feature word set of user needs is extracted. Based on the technical field and word frequency distribution corresponding to user needs, the parameters of the BM25 algorithm are dynamically adjusted. Using the BM25 algorithm with adjusted parameters, patent texts are screened from the patent feature library to be matched based on the feature word set to obtain a coarse-ranked patent text set. The coarse-ranked patent text set includes the module fusion feature vector of the screened patent texts. Based on the patent feature extraction model, a module fusion feature vector matrix corresponding to user needs is generated. Based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
[0005] Another aspect of the present invention provides a patent matching device based on user needs to improve the matching degree between patent search results and user needs, the device comprising: The patent feature library construction module is used to input historical patent texts into the patent feature extraction model, generate multiple module fusion feature vector matrices for each historical patent text, vectorize all module fusion feature vector matrices, and store them in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture, and integrates a word frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. The coarse ranking module is used to extract the feature word set of user needs after obtaining user needs, dynamically adjust the parameters of the BM25 algorithm according to the technical field and word frequency distribution corresponding to the user needs, and use the parameter-adjusted BM25 algorithm to filter patent texts from the patent feature library to be matched according to the feature word set to obtain a coarse ranking patent text set. The coarse ranking patent text set includes the module fusion feature vector of the selected patent texts. The fine-ranking module is used to generate a module fusion feature vector matrix corresponding to user needs based on the patent feature extraction model. Based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
[0006] This invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned patent matching method based on user needs.
[0007] This invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned patent matching method based on user needs.
[0008] This invention also provides a computer program product, which includes a computer program that, when executed by a processor, implements the aforementioned patent matching method based on user needs.
[0009] In this embodiment of the invention, historical patent texts are input into a patent feature extraction model to generate multiple module fusion feature vector matrices for each historical patent text. All module fusion feature vector matrices are vectorized and stored in a patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture, integrating a term frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. After obtaining user requirements, a set of feature words corresponding to the user requirements is extracted. Based on the technical field and term frequency distribution corresponding to the user requirements, the parameters of the BM25 algorithm are dynamically adjusted. Using the parameter-adjusted BM25 algorithm, patent texts are filtered from the patent feature library to be matched based on the feature word set to obtain a coarse-ranked patent text set. The coarse-ranked patent text set includes the module fusion feature vectors of the filtered patent texts. Based on the patent feature extraction model, a module fusion feature vector matrix corresponding to the user requirements is generated. Based on the module fusion feature vector matrix corresponding to the user requirements, patent texts are filtered from the coarse-ranked patent text set to obtain a fine-ranked patent text set.
[0010] Compared to existing technologies that decompose user needs into basic keywords and then directly perform full-text searches in patent databases, the proposed solution in this invention uses a patent feature extraction model based on the Bidirectional Encoder Representation Model (BERT) with a Transformer architecture. This overcomes the limitations of traditional feature extraction models that only focus on textual semantics and ignore domain characteristics and differences in text granularity. Through a term frequency-inverse document frequency mechanism, it weakens the interference of common redundant words, adapting to the characteristics of patent texts with dense technical terminology and high domain differentiation. Simultaneously, combined with a multi-granularity sentence distance supervision mechanism, it can accurately capture the semantic relationships and feature differences of different modules of the patent text (title, abstract, claims, etc.). The generated module fusion feature vector matrix can comprehensively and accurately represent the core technical features of each historical patent. This feature vector matrix is vectorized and stored in the patent feature library to be matched, ensuring that the patent features in the library have high recognizability and high completeness, providing reliable feature support for subsequent fast and accurate matching, and reducing the problems of missed detections and false detections caused by feature extraction bias. This paper overcomes the shortcomings of the fixed parameters in existing BM25 algorithms by dynamically adjusting the core parameters of the BM25 algorithm based on the technical field and word frequency distribution corresponding to user needs. It flexibly adjusts the parameters to adapt to different scenario requirements, taking into account the characteristics of patent texts in different technical fields (such as communications, machinery, and chemical engineering) and the differences in word frequency distribution of user demand feature word sets (such as different proportions of high-frequency core technical terms). This avoids problems such as low recall accuracy and excessive redundant patents caused by fixed parameters. Using the parameter-adjusted BM25 algorithm, combined with the user demand feature word set, to screen patents from the patent feature library to be matched, it can quickly filter out a set of coarse-ranked patent texts that initially match the user's technical field and core features. This ensures the recall rate of coarse-ranking (avoiding the omission of potential matching patents) while effectively controlling the size of the coarse-ranked set, eliminating a large number of patents irrelevant to user needs, significantly reducing the invalid workload of subsequent fine-ranking screening, and improving the overall efficiency of patent matching. Employing a patent feature extraction model consistent with historical patent feature extraction, a module-fused feature vector matrix corresponding to user needs is generated. This ensures complete consistency in the extraction logic and format between user need features and historical patent features, avoiding matching deviations caused by inconsistencies in feature dimensions and extraction rules. Simultaneously, using the module-fused feature vector matrix corresponding to user needs as the core search condition, precise filtering is performed from the coarsely ranked patent text set. This focuses on the core technical features of user needs, achieving a deep technical match between user needs and patent texts. Compared to existing fine-ranking methods based solely on keyword matching, this significantly improves the accuracy of the fine-ranked patent text set. The final output fine-ranked patent text set accurately matches the user's personalized technical needs, providing high-quality candidate solutions for subsequent patent transactions and technical references, reducing the time cost for users to screen patents, and enhancing the user experience.Through a complete process design of "patent feature database construction - coarse ranking and screening - fine ranking and screening," a complete patent matching closed loop is formed. The optimized design of the patent feature extraction model enables it to adapt to patent texts in different technical fields. The dynamic parameter adjustment of the BM25 algorithm allows it to adapt to the word frequency distribution and domain characteristics of different user needs. The dual-stage screening mode of coarse and fine ranking ensures both matching efficiency and accuracy. The entire technical solution requires no manual intervention in parameter adjustment and feature screening, has a high degree of automation, and can be stably applied to patent matching scenarios with multiple technical fields and diverse user needs. It solves the problems of weak adaptability, poor stability, and high manual costs of existing patent matching technologies, and has broad application prospects. Attached Figure Description
[0011] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In the drawings: Figure 1 This is a flowchart of a patent matching method based on user needs in an embodiment of the present invention; Figure 2 This is a schematic diagram of the structure of the patent feature extraction model in an embodiment of the present invention; Figure 3 This is a schematic diagram of the coding layer structure of the patent feature extraction model in an embodiment of the present invention; Figure 4 This is a structural diagram of a patent matching device based on user needs in an embodiment of the present invention; Figure 5 This is a schematic diagram of a computer device in an embodiment of the present invention. Detailed Implementation
[0012] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but are not intended to limit the present invention.
[0013] As mentioned above, existing technologies suffer from low retrieval efficiency and are unable to meet users' requirements for accuracy and efficiency in patent retrieval in practical application scenarios. To address this issue, this invention proposes a patent resource matching scheme.
[0014] Figure 1 This is a flowchart of a patent matching method based on user needs in an embodiment of the present invention, such as... Figure 1 As shown, the method includes: Step 101: Input the historical patent text into the patent feature extraction model to generate a multi-module fusion feature vector matrix for each historical patent text. Vectorize all the module fusion feature vector matrices and store them in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture and integrates a word frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism.
[0015] Step 102: After obtaining user requirements, extract the feature word set of user requirements. Based on the technical field and word frequency distribution corresponding to user requirements, dynamically adjust the parameters of the BM25 algorithm. Using the BM25 algorithm with adjusted parameters, filter patent texts from the patent feature library to be matched based on the feature word set to obtain a coarse-ranked patent text set. The coarse-ranked patent text set includes the module fusion feature vectors of the selected patent texts.
[0016] Step 103: Based on the patent feature extraction model, generate a module fusion feature vector matrix corresponding to user needs. Based on the module fusion feature vector matrix corresponding to user needs, filter patent texts from the coarse-ranked patent text set to obtain a fine-ranked patent text set.
[0017] Depend on Figure 1As shown in the process, compared with the existing technology that decomposes user needs into basic keywords and then directly performs full-text retrieval in the patent database, the present invention, through a patent feature extraction model based on the Transformer-based bidirectional encoder representation model (BERT), overcomes the limitations of traditional feature extraction models that only focus on text semantics and ignore domain characteristics and differences in text granularity. Through the term frequency-inverse document frequency mechanism, it can weaken the interference of common redundant words, adapting to the characteristics of patent texts with dense technical terminology and high domain differentiation. Simultaneously, combined with a multi-granularity sentence distance supervision mechanism, it can accurately capture the semantic relationships and feature differences of different modules of the patent text (title, abstract, claims, etc.), and the generated module fusion feature vector matrix can comprehensively and accurately represent the core technical features of each historical patent. This feature vector matrix is vectorized and stored in the patent feature library to be matched, ensuring that the patent features in the feature library have high recognizability and high completeness, providing reliable feature support for subsequent fast and accurate matching, and reducing the problems of missed detections and false detections caused by feature extraction bias. This paper overcomes the shortcomings of existing technologies where the BM25 algorithm parameters remain fixed. It dynamically adjusts the core parameters of the BM25 algorithm based on the technological field and word frequency distribution corresponding to user needs. For patent text characteristics in different technological fields (such as communications, machinery, and chemical engineering) and differences in word frequency distribution of user demand feature word sets (such as different proportions of high-frequency core technical terms), the parameters are flexibly adjusted to adapt to scenario requirements, avoiding problems such as low coarse-ranking recall accuracy and excessive redundant patents caused by fixed parameters. Using the parameter-adjusted BM25 algorithm, combined with the user demand feature word set, patents are screened from the patent feature library to be matched. This allows for the rapid selection of a coarse-ranked patent text set that initially matches the user's technological field and core features. This ensures high recall rate in the coarse-ranking (avoiding the omission of potential matching patents) while effectively controlling the size of the coarse-ranked set, eliminating a large number of patents irrelevant to user needs, significantly reducing the unnecessary workload of subsequent fine-ranking screening, and improving the overall efficiency of patent matching. Employing a patent feature extraction model consistent with historical patent feature extraction, a module-fused feature vector matrix corresponding to user needs is generated. This ensures complete consistency in the extraction logic and format between user need features and historical patent features, avoiding matching deviations caused by inconsistencies in feature dimensions and extraction rules. Simultaneously, using the module-fused feature vector matrix corresponding to user needs as the core search condition, precise filtering is performed from the coarsely ranked patent text set. This focuses on the core technical features of user needs, achieving a deep technical match between user needs and patent texts. Compared to existing fine-ranking methods based solely on keyword matching, this significantly improves the accuracy of the fine-ranked patent text set. The final output fine-ranked patent text set accurately matches the user's personalized technical needs, providing high-quality candidate solutions for subsequent patent transactions and technical references, reducing the time cost for users to screen patents, and enhancing the user experience.Through a complete process design of "patent feature database construction - coarse ranking and screening - fine ranking and screening," a complete patent matching closed loop is formed: the optimized design of the patent feature extraction model enables it to adapt to patent texts in different technical fields; the dynamic adjustment of the parameters of the BM25 algorithm enables it to adapt to the word frequency distribution and domain characteristics of different user needs; and the dual-stage screening mode of coarse and fine ranking ensures both matching efficiency and accuracy. The entire technical solution requires no manual intervention in parameter adjustment and feature screening, has a high degree of automation, and can be stably applied to patent matching scenarios with multiple technical fields and diverse user needs. It solves the problems of weak adaptability, poor stability, and high manual costs of existing patent matching technologies, and has broad application prospects.
[0018] In step 101, historical patent texts are input into the patent feature extraction model to generate multiple module fusion feature vector matrices for each historical patent text. All module fusion feature vector matrices are vectorized and stored in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture and integrates a word frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism.
[0019] Figure 2 This is a schematic diagram of the patent feature extraction model in an embodiment of the present invention. In this embodiment, the patent feature extraction model in step 101 further includes: Input layer 201 is used to generate an embedding vector matrix corresponding to each module text of the patent text based on the patent text, the patent terminology definition table, the technical field tags, and the preset patent structure tags.
[0020] Encoding layer 202 is used to fuse contextual semantics and module structure priority in the embedding vector matrix corresponding to each module text, and generate a module fusion feature vector matrix based on the domain fusion coefficient.
[0021] The task output layer 203 is used to perform mask position token prediction, sentence pair fine-grained distance prediction, and token module text prediction based on the module fusion feature vector matrix. It calculates the corresponding loss based on the predicted mask position token prediction result, sentence pair fine-grained distance, and token module text. It fuses all calculated losses to obtain the fusion loss, and updates the parameters of the patent feature extraction model through backpropagation based on the fusion loss.
[0022] In this embodiment of the invention, the core function of the input layer is to receive various types of raw and auxiliary data related to the patent, preprocess and transform these data, and finally generate an embedding vector matrix corresponding to each module text in the patent text, providing basic input for the feature processing of the encoding layer. The input data required by the input layer includes the full text of the patent (e.g., "A BERT-based patent retrieval method in the mechanical field, which relates to the field of patent retrieval technology..."), a patent terminology definition table (e.g., containing mechanical field-specific terms such as patent retrieval, BERT model, and mechanical structure), the technical field tag corresponding to the patent (e.g., mechanical field), and preset patent structure tags (e.g., technical field, abstract, claims, etc.). The encoding layer takes the embedding vector matrix corresponding to each module text output by the input layer as input and performs deep processing on the embedding vector matrix of each module text. On the one hand, it fuses the contextual semantic features within the module text; on the other hand, it fuses the structural priority features of each module of the patent. Simultaneously, it combines the fusion coefficient of the technical field to which the patent belongs (e.g., mechanical field fusion coefficient 0.7) to adaptively fuse the two types of features, ultimately generating a module fusion feature vector matrix corresponding to each module text. The task output layer takes the module fusion feature vector matrix output by the encoding layer as its core input. Based on this matrix, it sequentially performs three types of prediction tasks: mask position token prediction, sentence pair fine-grained distance prediction, and token-to-module text prediction. For each type of prediction task, the corresponding loss value is calculated, and the three types of loss values are then fused to obtain a fused loss value. Finally, based on this fused loss value, the backpropagation algorithm is used to update and optimize all parameters of the patent feature extraction model, thereby improving the model's feature extraction accuracy and patent domain adaptability.
[0023] In this embodiment, the process of generating an embedding vector matrix corresponding to each module text of the patent text based on the patent text, the patent terminology definition table, the technical field tags, and the preset patent structure tags in the input layer may further include: splitting each patent text into a module text sequence according to the preset patent structure tags, and forming a patent structure embedding vector; segmenting each module text in the module text sequence into a token sequence, and mapping the token sequence into a fixed-dimensional token embedding vector; encoding the position information of each token sequence in the module text sequence to form a position embedding vector; converting the technical field tags to which each patent text belongs into a field embedding vector; mapping the patent terminology definitions in the patent terminology definition table into an external knowledge embedding vector; extracting three-dimensional technical features of function, behavior, and structure from each module text, forming triples to form a three-dimensional technical feature embedding vector; encoding the examination process and legal status of the patent text to form a lifecycle embedding vector; for each module text, obtaining a feature word importance feature vector based on the extracted word frequency-inverse document frequency features and the generated word frequency-inverse document frequency word weight vector; and embedding the patent structure embedding vector, The token embedding vector, location embedding vector, domain embedding vector, external knowledge embedding vector, 3D technical feature embedding vector, lifecycle embedding vector, and feature word importance feature vector are concatenated according to dimensions to form the embedding vector matrix corresponding to each module text.
[0024] In the above embodiments, when splitting each patent text into a module text sequence according to a preset patent structure tag and forming a patent structure embedding vector, a preset patent structure tag is required. In this embodiment of the invention, the preset patent structure tag includes technical field, abstract, claims (independent or dependent), background art, invention content, embodiments, etc. Then the entire patent text will be split into independent module texts such as technical field, abstract, claims (independent or dependent), background art, invention content, embodiments, etc. Each module text is tokenized separately to form a module text sequence, such as an abstract module text sequence and a claim 1 text sequence. Based on the module text sequence, a patent structure embedding vector can be formed.
[0025] Token embedding vectors are used to capture basic semantic features; positional embedding vectors can preserve textual word order features; domain embedding vectors are used to encode domain-specific features, such as mechanical, communication, and chemical engineering features; patent structure embedding vectors are used to encode module attribute features; based on dependency parsing technology, three-dimensional technical features of function, behavior, and structure are extracted from the patent text, and FBS triples are encoded into three-dimensional technical feature embedding vectors through GraphEncoder; the patent examination process and legal status (grant, rejection, invalidation) are encoded into lifecycle embedding vectors.
[0026] The term frequency-inverse document frequency (TNF) algorithm is used to extract features from the text of each module, extracting feature words for calculation (e.g., extracting 5000-dimensional feature words) to construct the TNF-inverse document frequency features for each module text. Based on the TNF-inverse document frequency features of each module text, an initial weight vector of TNF-inverse document frequency words is generated. Then, the feature word set is matched token by token with the token sequence of the module text. If the token exists in the feature word set, the initial weight value of the TNF-inverse document frequency corresponding to the token in the initial weight vector of the TNF-inverse document frequency words is retained; if the token does not exist in the feature word set, the initial weight value of the TNF-inverse document frequency of the token is assigned to 0. Finally, a token-weight mapping list with the same length as the token sequence of the current module text is obtained (example: if the token sequence is ["mechanical field", "patent search", "method"], then the token-weight mapping list is [0.85, 0.92, 0.0], where "method" is not in the feature word set and its initial weight is 0). Retrieve the pre-defined term frequency-inverse document frequency weighting coefficient table for each technical field (e.g., 1.2 for mechanical field, 1.15 for communication field, and 1.1 for chemical field). For the feature words in the token-weight mapping list that also appear in the patent terminology definition table, perform a weighted calculation: Weighted weight value = Original term frequency-inverse document frequency weight value × Weighting coefficient of the patent's field. For feature words that do not appear in the patent terminology definition table, their weight values remain unchanged (still the original value from step 2 or 0). Finally, obtain the weighted token-weight mapping list. Example: In mechanical field patents, the feature word "patent search" appears in the patent terminology definition table, and the original term frequency-inverse document frequency weight value is 0.92, so the weighted weight value = 0.92 × 1.2 = 1.104; the feature word "mechanical field" does not appear in the patent terminology definition table, so its weight value remains 0.85. The weighted token-weight mapping list is mapped to a fixed dimension (e.g., 768 dimensions) consistent with other embedding vectors, resulting in a term frequency-inverse document frequency (IF-IVF) weight vector. Based on the extracted IF-IVF features and the generated IF-IVF weight vector, a feature word importance vector is obtained. Each token corresponds to an independent feature word importance vector; therefore, the final set of feature word importance vectors has the same length as the token sequence of the module text. It can be dimensionally concatenated with the patent structure embedding vector, token embedding vector, etc., token by token.
[0027] Figure 3 This is a schematic diagram of the coding layer of the patent feature extraction model in an embodiment of the present invention. In this embodiment, the coding layer includes: Domain-adaptive masking layer 301 is used to perform a first probability mask on preset patent core terms in each embedded vector matrix according to preset masking rules, and to perform a second probability mask on preset general words, wherein the first probability is greater than the second probability; semantic completion is performed on the token of each of the first probability mask and the second probability mask to obtain the masked embedded vector matrix and mask loss value of the text module; The dual-tower attention layer 302 is used to perform self-attention operations on the masked post-embedded vector matrix. It introduces word frequency-inverse document frequency attention-guided weight calculation logic to generate guided self-attention weight matrix and structural attention weight matrix. The guided self-attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module semantic feature vector that integrates contextual semantics. The guided structural attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module structural feature vector that integrates module structural priority. The cross-layer fusion gate layer 303 is used to dynamically adjust the dual-tower fusion coefficient according to the fusion coefficient of the patent's domain; according to the dual-tower fusion coefficient, the module semantic feature vector and the module structural feature vector are fused token by token to obtain a preliminary module fusion feature vector matrix. The preliminary module fusion feature vector matrix is filtered by the gate coefficient to obtain the module fusion feature vector matrix.
[0028] In the above embodiment, the domain-adaptive masking layer 301 serves as the bottom layer of the encoding layer. Its core function is to perform differential masking and preliminary semantic optimization on the embedding vector matrix corresponding to each input module text. The specific execution steps are as follows: First, the domain-adaptive masking layer receives the embedding vector matrix corresponding to the text of each module output by the input layer (example: 20×768 embedding vector matrix of module 1 of claim, 15×768 embedding vector matrix of module 1 of summary), and simultaneously obtains the preset masking rules, the preset set of core patent terms, and the preset set of general terms. Among them, the preset set of core patent terms is the core terms specific to the patent domain obtained by filtering based on the patent terminology definition table, including technical feature terms, patent legal terms, etc. (example: the core term set of this mechanical patent is [BERT, patent search method, tokenization processing, mechanical patent, independent claim]); the preset set of general terms consists of commonly used words that do not have patent domain-specific features, such as conjunctions and auxiliary words such as "including", "a", "described", and "the following".
[0029] Secondly, according to the preset masking rules, the vectors corresponding to each token in the embedded vector matrix are subjected to differential masking processing: for tokens belonging to preset core patent terms in the embedded vector matrix, a first probability masking operation is performed (example: the first probability is set to 45%, that is, core terms such as "patent retrieval method" and "tokenization processing" have a 45% probability of being masked); for tokens belonging to preset general terms in the embedded vector matrix, a second probability masking operation is performed (example: the second probability is set to 15%, that is, general terms such as "a" and "including" have a 15% probability of being masked); where the value of the first probability is greater than the value of the second probability, through this differential masking setting, the model's learning of core patent terms can be strengthened, the ineffective learning of general terms can be reduced, and the model's patent domain adaptability can be improved; for example: in the token sequence of claim 1 module, "patent retrieval method" (core term) has a 45% probability of being masked, and "including" (general term) has a 15% probability of being masked. In the end, "patent retrieval method" is selected for masking, and "including" is not selected.
[0030] Finally, for each token marked as a mask position after the first and second probabilistic masking operations, preliminary semantic completion processing is performed. For example, in the claim 1 module, "patent retrieval method" is marked as a mask position. The domain-adaptive masking layer predicts the original token for this mask position based on its context tokens "BERT-based mechanical field" and "characterized by including the following steps," assuming the prediction result is "patent matching method." Then, the predicted result "patent matching method" is compared with the real token "patent retrieval method," and the mask loss value is calculated. Simultaneously, the vector at the mask position is updated to the semantically completed vector, resulting in the masked embedding vector matrix corresponding to each module text (for example, the masked embedding vector matrix for the claim 1 module remains 20×768, only the row vector corresponding to "patent retrieval method" is updated). The domain-adaptive masking layer outputs the masked embedding vector matrix and the corresponding mask loss value for each module text, where the mask loss value is used for loss fusion in the subsequent task output layer.
[0031] The dual-tower attention layer 302, as the middle layer of the encoding layer, has the core function of extracting two-dimensional attention features from the masked embedding vector matrix output by the domain-adaptive masking layer. It extracts the contextual semantic features and module structure priority features of the module text, respectively. Simultaneously, it introduces word frequency-inverse document frequency attention-guided weights to strengthen the attention ratio of core terms. Specifically, it includes two parallel attention sub-layers: Tower 1 (text sequence attention layer) and Tower 2 (patent structure attention layer). The specific execution steps are as follows: Step 1, Text Sequence Attention Extraction (Tower 1): Tower 1 receives the masked post-embedding vector matrix (example: 20×768 masked post-embedding vector matrix of module 1) corresponding to each module text output by the domain adaptive masking layer, and performs self-attention operation on this masked post-embedding vector matrix; through the self-attention operation, the semantic association weight between each token and all other tokens in the token sequence of the module text is calculated as the self-attention weight. Then, the term frequency-inverse document frequency attention-guided weight calculation logic is introduced to optimize the self-attention weight. The specific calculation formula is as follows: Guided self-attention weight = (self-attention weight + x) × The term frequency-inverse document frequency weight is used, where x represents the influence rate of the term frequency-inverse document frequency weight. The larger the value, the greater the influence of the term frequency-inverse document frequency weight on the final attention weight. The value of x varies for different technical fields, specifically preset as follows: in the mechanical field, x is 0.8; in the communication field, x is 0.69; other fields can preset corresponding values according to their characteristics to obtain the guided self-attention weights. These guided self-attention weights are organized into a guided self-attention weight matrix according to the correspondence between "token" and "context token" (example: the guided self-attention weight matrix of the module in claim 1 is 20×20). Example: In this matrix, "tokenization processing" (the 18th token)... The self-attention weight of the token after guidance is 0.8 for "patent text in the mechanical field" (the 16th token) (the two are semantically strongly related, and the object of "tokenization" is "patent text in the mechanical field"), and the self-attention weight of the token after guidance is 0.05 for "a kind" (the 1st token) (the two are semantically very weakly related). Subsequently, the self-attention weight matrix after guidance is multiplied by the masked embedding vector matrix, and the masked embedding vector matrix is weighted and adjusted to strengthen the features of tokens with high semantic relevance and weaken the features of tokens with low semantic relevance. Finally, the module semantic feature vector matrix corresponding to each module text is obtained by fusing contextual semantic features.
[0032] The second step is patent structure attention extraction (Tower 2): Tower 2 receives the masked post-embedding vector matrix corresponding to the text of each module output by the domain adaptive masking layer (example: the 20×768 masked post-embedding vector matrix of claim 1 module), and simultaneously obtains the preset patent module weight matrix; this patent module weight matrix is a set of weights pre-set according to the functional priority of each module of the patent, for example, the weight of the claim module is higher than that of the abstract module, and the weight of the abstract module is higher than that of the embodiment module. Specifically, it can be set to claim module weight 1.6, abstract module weight 1.2, embodiment module weight 0.8, background technology module weight 0.7, etc.; example: claim 1 module belongs to the claim module, and its weight is 1.6; Tower 2 performs self-attention operation on the masked post-embedding vector matrix to generate an initial structure attention weight matrix (20×20), and then introduces the same term frequency-inverse document as above. The frequency attention-guided weight calculation logic optimizes the initial structural attention weights to obtain an optimized structural attention weight matrix. Then, the optimized structural attention weight matrix is weighted and fused with the patent module weight matrix (1.6) to obtain the guided structural attention weight matrix (example: each element in the optimized weight matrix is multiplied by 1.6). Finally, the guided structural attention weight matrix is multiplied with the masked embedding vector matrix to adjust the weights of the masked embedding vector matrix, strengthen the features of tokens in high-priority modules, and weaken the features of tokens in low-priority modules. Finally, the module structure feature vector matrix corresponding to the fused module structure priority features of each module text is obtained (example: the module structure feature vector matrix of claim 1 module is 20×768, and its feature strength is higher than the corresponding matrices of the summary module and the embodiment module).
[0033] The cross-layer fusion gate layer 303, as the top layer of the encoding layer, has the core function of performing domain-adaptive fusion of the two types of feature vector matrices output by the dual-tower attention layer and filtering redundant features. The specific execution steps are as follows: First, the cross-layer fusion gating layer receives the module semantic feature vector matrix and module structural feature vector matrix corresponding to the module text output by the dual-tower attention layer (example: the 20×768 module semantic feature vector matrix and 20×768 module structural feature vector matrix of the module in claim 1). Simultaneously, it obtains the domain fusion coefficient corresponding to the technical field to which the patent belongs. This domain fusion coefficient is pre-set according to the characteristics of different technical fields and is used to dynamically adjust the fusion ratio of module semantic features and module structural features. For example, if the domain fusion coefficient for the mechanical field is set to 0.7 and the domain fusion coefficient for the communication field is set to 0.6, and assuming the current patent belongs to the mechanical field, then the domain fusion coefficient α of this patent is 0.7.
[0034] Secondly, the dual-tower fusion coefficient is dynamically adjusted based on the domain fusion coefficient of the patent's technical field. The dual-tower fusion coefficient includes a tower 1 fusion coefficient and a tower 2 fusion coefficient. The tower 1 fusion coefficient equals the domain fusion coefficient, and the tower 2 fusion coefficient equals 1 minus the domain fusion coefficient (example: tower 1 fusion coefficient = 0.7, tower 2 fusion coefficient = 1 - 0.7 = 0.3). Subsequently, according to the dual-tower fusion coefficient, a token-by-tower feature fusion process is performed on the module semantic feature vector matrix and the module structural feature vector matrix. That is, the token vectors at corresponding positions in the two matrices are multiplied by their respective fusion coefficients and then summed (example: in the module of claim 1, the module semantic feature vector of the 18th token (tokenized) is [0.5, 0.6, ..., 0.7], and the module structural feature vector is [0.8, 0.9, ..., 1.0]. The fused vector = 0.7 × [0.5, 0.6, ..., 0.7] + 0.3 × [0.8, 0.9, ..., 1.0] = [0.59,0.69,...,0.79]). In this way, the preliminary module fusion feature vector matrix corresponding to each module text can be obtained (example: the preliminary module fusion feature vector matrix of the module of claim 1 is 20×768).
[0035] Finally, the cross-layer fusion gating layer introduces the sigmoid activation function to dynamically calculate the corresponding gating coefficient according to the feature intensity of each token vector (example: set the value range of the gating coefficient to 0 to 1, then the maximum value of the gating coefficient is 1; in the module of claim 1, the token vector corresponding to "tokenization processing" has a high feature intensity, and set its gating coefficient g = 0.9; the token vector corresponding to the general word "of" has a low feature intensity, and set its gating coefficient g = 0.1); the preliminary module fusion feature vector matrix is subjected to redundant feature filtering through this gating coefficient. The specific steps are as follows: multiply the gating coefficient by the token vector at the corresponding position in the preliminary module fusion feature vector matrix to obtain the first product, calculate the difference between the maximum value of the gating coefficient and the current gating coefficient, multiply this difference by the original embedding vector corresponding to this token to obtain the second product, and calculate the sum of the first product and the second product to obtain the final vector of each tokenization processing; example: the final vector of "tokenization processing" = 0.9×[0.59, 0.69,..., 0.79] + (1 - 0.9)×original embedding vector; the final vector of "of" = 0.1×preliminary fusion vector + 0.9×original embedding vector (weakening its features and filtering redundancy); finally, obtain the module fusion feature vector matrix corresponding to each module text (example: the module fusion feature vector matrix of the module of claim 1 is 20×768); this module fusion feature vector matrix simultaneously contains the context semantic features, module structure priority features and domain adaptation features of the module text, providing the core input for the prediction task of the task output layer.
[0036] In the embodiment, the task output layer is used to: based on the module fusion feature vector matrix, predict the tokens at the masked positions in the first probability mask and the second probability mask, calculate the cross-entropy loss according to the prediction results of the tokens at the masked positions and the true tokens at the masked positions; based on the module fusion feature vector matrix, predict the fine-grained distances of the sentence pairs in each module content, and calculate the mean square error loss according to the predicted fine-grained distances of each group of sentence pairs and the fine-grained distance labels; where the fine-grained distances are adaptively divided according to the total number of sentences in the module text; based on the module fusion feature vector matrix, predict the module text to which each token in each module text belongs, and calculate the classification loss according to the predicted module text to which the token belongs and the patent structure embedding vector; according to the fused cross-entropy loss, mean square error loss and classification loss, optimize the parameters of the patent feature extraction model through backpropagation.
[0037] In the above embodiment, the mask position is the location of the replaced token. The task output layer is used to perform three types of prediction tasks based on the module fusion feature vector matrix output by the encoding layer, calculate the corresponding loss values and fuse them, and finally optimize the model parameters through backpropagation. The specific execution steps are as follows (continuing the module example of the above mechanical field patent, claim 1): The first step is to predict the mask position token and calculate the cross-entropy loss: The task output layer receives the module fusion feature vector matrix corresponding to each module text output by the encoding layer (example: the 20×768 module fusion feature vector matrix of module 1). Based on this module fusion feature vector matrix, the model predicts the token for each mask position after processing with the first and second probability masks in the domain adaptive masking layer. For example, in module 1, the mask position is the position corresponding to "patent retrieval method". The model predicts the token for this position based on the module fusion feature vector, and the prediction result is "patent matching method". The algorithm first calculates the prediction result of the mask position using the "patent matching method" method. Then, it retrieves the pre-stored real token corresponding to each mask position. This real token is the token that originally existed at the mask position before the module text performed the masking operation (e.g., the real token at this mask position is "patent retrieval method"). The algorithm then compares the predicted token result of each mask position with the corresponding real token and calculates the cross-entropy loss value using the cross-entropy loss formula. This loss value is used to quantify the model's prediction error for the mask position token. For example, because the predicted result "patent matching method" differs from the real token "patent retrieval method", the calculated cross-entropy loss value is 0.8 (the larger the loss value, the greater the prediction error).
[0038] The second step involves performing fine-grained distance prediction and mean squared error loss calculation for sentence pairs based on the module-fused feature vector matrix.
[0039] This invention proposes two fine-grained distance division rules: Rule 1: Number of sentences in the module < First quantity The first quantity is a predefined value, for example, it can be defined as 10 sentences. For module text where the total number of sentences is less than the first quantity, a fine-grained interval annotation method is used to divide the fine-grained distance into multiple levels, such as the following 6 levels: 0: Adjacent (no other sentences between the two sentences); 1: One sentence between two sentences; 2: Two sentences between each sentence; 3: There are 3 sentences between each two sentences; 4: There are 4 sentences between each sentence. ≥5: Further (there are 5 or more sentences between the two sentences).
[0040] Rule 2: Number of sentences in the module > First quantity
[0041] For long text modules with a total number of sentences greater than the first requirement, a two-tiered granularity annotation method of "intra-paragraph distance" and "cross-paragraph distance" is introduced, as follows: Within the same paragraph: use a fine-grained annotation method with intervals of 0 to 10 sentences, i.e., distance levels of 0 (adjacent), 1 (1 sentence interval) ... 10 (10 sentences interval); if two sentences within the same paragraph are more than 10 sentences apart, then mark them as "≥11"; Crossing paragraphs: The unified label is a compound label in the form of "crossing k paragraphs + spacing m sentences", where k is the number of paragraphs between the two sentences and m is the relative spacing between the two sentences within their respective paragraphs; Example: "crossing 1 paragraph + 5 sentences" means that the two sentences belong to two adjacent paragraphs and the relative spacing between them within their respective paragraphs is 5 sentences.
[0042] The task output layer, based on the module fusion feature vector matrix corresponding to each module text, performs fine-grained distance prediction on all sentence pairs within each module text. According to the total number of sentences in the module text, the corresponding adaptive granularity partitioning rule is applied to obtain the predicted fine-grained distance for each sentence pair. Example 1 (Short text module, number of sentences < 10): The claim 1 module text contains 5 sentences, with a 1-sentence gap between sentence 3 and sentence 5. The model predicts a fine-grained distance of 1 between them, and the corresponding true distance label is 1. Example 2 (Long text module, number of sentences > 50): The invention content module contains 60 sentences, divided into 3 paragraphs. Sentence 8 in paragraph 1 and sentence 2 in paragraph 3 are separated by a 1-paragraph gap, with a relative gap of 6 sentences. The model predicts "across 1 paragraph + 6 sentences", and the corresponding true label is "across 1 paragraph + 6 sentences". Six sentences; simultaneously, the fine-grained distance labels corresponding to each pre-labeled sentence pair are retrieved. These labels are the actual distances between sentences labeled according to the logical structure of the patent module text (example: sentence 1 and sentence 2 are adjacent sentences, and the actual distance label is 1); the predicted fine-grained distance of each sentence pair is compared with the corresponding fine-grained distance label, and the mean squared error loss value is calculated using the mean squared error loss formula; for example: if the predicted distance label 1 is consistent with the actual distance label 1, the mean squared error of this sentence pair is 0. If the model predicts the distance label as 2, then the mean squared error is (2-1)²=1; finally, the mean squared error loss value of this module is calculated to be 0.2.
[0043] The third step is to predict the module text to which the token belongs and calculate the classification loss: Based on the module fusion feature vector matrix corresponding to each module text, the task output layer predicts the module text to which all tokens belong within each module text, thus obtaining the predicted module text to which each token belongs. Example: In the Claim 1 module, the token corresponding to "tokenization" is predicted by the model to belong to the "claims module". At the same time, the patent structure embedding vector corresponding to each token is retrieved. This vector contains the true attribute features of the module text to which the token belongs (example: the patent structure embedding vector corresponding to "tokenization" represents that its true module text belongs to "claims module"). The predicted module text to which each token belongs is compared with the true module text represented by the corresponding patent structure embedding vector, and the classification loss value is calculated using the classification loss formula. This loss value is used to quantify the model's prediction error for the patent module text to which the token belongs. Example: If the model predicts "tokenization" as "implementation module", a classification error occurs, and the final calculated classification loss value for this module text is 0.3.
[0044] Step 4: Loss Fusion and Model Parameter Optimization: The task output layer fuses the calculated cross-entropy loss, mean squared error loss, and classification loss. The three loss values are weighted and summed according to preset loss fusion weights to obtain the fused loss value. These preset weights can be set based on model training requirements; for example, a weight of 0.3 for cross-entropy loss, 0.5 for mean squared error loss, and 0.2 for classification loss. Example: Fusion loss value = 0.3 × 0.8 + 0.5 × 0.2 + 0.2 × 0.3 = 0.24 + 0.1 + 0.06 = 0.4; Finally, based on this fusion loss value, the backpropagation algorithm is used to update and optimize all parameters of the patent feature extraction model, adjust the weight coefficients of each layer of the model, reduce the overall prediction error of the model, and improve the model's extraction accuracy of patent text features and patent domain adaptation performance; Example: Because the fusion loss value of 0.4 has an error, the model adjusts the weight coefficients of the dual-tower attention in the encoding layer and the mapping parameters of the embedding vector in the input layer through backpropagation, so that the loss value of the next training is reduced.
[0045] During training, the patent feature extraction model requires preparation of training samples. Taking the mechanical field as an example, a sufficient number of mechanical patent texts are selected as training samples. Each training sample retains the complete full-text patent text to ensure that the samples cover patents of different technical directions and structural types in the mechanical field, thus guaranteeing the diversity and representativeness of the training samples. At the same time, each training sample is manually labeled. The labeling content includes the technical field label corresponding to the patent text (example: "mechanical field"), the module text boundary corresponding to the preset patent structure label, the fine-grained distance label of sentence pairs, and the module text label to which the token belongs, providing a true label basis for subsequent loss calculation.
[0046] Supporting data preparation includes: preparing a patent terminology definition table, a dedicated terminology database built based on the characteristics of the patent field, covering patent technical terms and patent legal terms in the mechanical field; pre-setting a patent structure tag set, including technical field tags, abstract tags, claim tags, background technology tags, invention content tags, and embodiment tags; setting pre-setting masking rules, including a pre-setting core patent terminology set, a pre-setting general terminology set, and corresponding first probability mask values and second probability mask values; setting a patent module weight matrix, assigning corresponding weights according to the functional priority of each patent module; setting domain fusion coefficients corresponding to different technical fields; and setting loss fusion weights for subsequent fusion calculations of the three types of loss values.
[0047] Model initialization: Initialize all trainable parameters of the input layer, encoding layer, and task output layer of the patent feature extraction model, including embedding vector mapping parameters, self-attention weight parameters, gating layer coefficients, output layer prediction parameters, etc. All parameters are set to initial values that meet the model training requirements to ensure that the model can start training normally.
[0048] Then, the following single training iteration process begins: 1. Input layer data preprocessing and embedding vector matrix generation The full text of a single mechanical field patent training sample, the patent terminology definition table, the corresponding technical field label, and the preset patent structure label are input into the input layer of the patent feature extraction model. The input layer completes preprocessing and embedding vector matrix generation according to the following specific process to provide basic input for the encoding layer: (1) Modular splitting of patent text and generation of patent structure embedding vector: The input layer splits the full text of the patent of the training sample into multiple independent module texts according to the preset patent structure labels. The multiple module texts form a module text sequence according to their original order in the patent. At the same time, according to the preset patent structure labels corresponding to each module text, the module labels are converted into fixed-dimensional vectors to generate the patent structure embedding vector corresponding to each module text, which is used to characterize the attribute features of the module text.
[0049] (2) Token sequence generation and token embedding vector conversion: For each module text in the module text sequence, the input layer performs separate token segmentation processing according to the principle of patent terminology integrity to ensure that patent-specific terms such as "independent claim" and "dependent claim" are not split, forming a token sequence corresponding to each module text; then, each token in the token sequence is mapped to a fixed-dimensional vector to generate a token embedding vector corresponding to each token, which is used to capture the basic semantic features of each token and realize the numerical representation of the token.
[0050] (3) Location embedding vector generation: The input layer performs location encoding on the token sequence corresponding to each module text, encoding the specific location information of each token in the token sequence of its module text, and converting the location information into a fixed-dimensional vector to form the location embedding vector corresponding to each token, which is used to preserve the text order features inside the module text and ensure the logical coherence of the context.
[0051] (4) Domain embedding vector generation: The input layer receives the technical field label (example: "mechanical field") corresponding to the training sample, and converts the technical field label into a fixed-dimensional domain embedding vector to encode the exclusive features of the technical field to which the patent belongs, adapting to the text feature differences of mechanical field patents.
[0052] (5) External knowledge embedding vector generation: The input layer receives a patent terminology definition table and maps each patent term in the vocabulary to a fixed-dimensional external knowledge embedding vector, which is used to incorporate exclusive external knowledge in the patent field and improve the model's ability to understand the semantics of patent terms.
[0053] (6) Generation of three-dimensional technical feature embedding vector and life cycle embedding vector: extract functional, behavioral and structural three-dimensional technical features from each module text, form triples and encode them as three-dimensional technical feature embeddings; encode the examination process and legal status of the patent text to form life cycle embeddings.
[0054] (7) Generation of word frequency-inverse document frequency word weight vector and feature word importance feature vector: Following the four-step process of “generating word frequency-inverse document frequency word weight vector → aligning feature word and token mapping → domain weighted calculation → dimension unification and standardization”, a 5000-dimensional word frequency-inverse document frequency word weight vector is generated, and a feature word importance feature vector with the same dimension as other embedded vectors is obtained.
[0055] (8) Concatenation of embedding vector matrix: The input layer concatenates the patent structure embedding vector corresponding to each module text, the token embedding vector corresponding to each token in the token sequence of the module text, the position embedding vector corresponding to each token, the domain embedding vector corresponding to the patent, the external knowledge embedding vector corresponding to each token, the three-dimensional technical feature embedding vector and the life cycle embedding vector, and the feature word importance feature vector according to the vector dimension to finally generate a unified embedding vector matrix corresponding to each module text. Each module text corresponds to an independent embedding vector matrix, providing comprehensive and high-quality input data for the feature processing of the encoding layer.
[0056] 2. Feature encoding at the encoding layer and generation of feature vector matrix through module fusion
[0057] The embedding vector matrix corresponding to each module text generated by the input layer is sequentially input into the encoding layer of the patent feature extraction model. The encoding layer completes feature encoding and generates the module fusion feature vector matrix through three sub-layers: a domain-adaptive masking layer, a dual-tower attention layer, and a cross-layer fusion gating layer. The specific process is as follows: (1) Domain Adaptive Masking Layer Processing: The domain adaptive masking layer receives the embedding vector matrix corresponding to each module text output by the input layer, and simultaneously obtains the preset masking rules, the preset patent core terminology set, and the preset general term set; according to the preset masking rules, it performs a first probability masking operation on the tokens belonging to the preset patent core terms in each embedding vector matrix, and performs a second probability masking operation on the tokens belonging to the preset general terms; subsequently, it performs semantic completion processing on all tokens marked as mask positions, predicts the tokens at the mask positions and calculates the masking loss value, and simultaneously generates the masked embedding vector matrix corresponding to each module text, and outputs it to the dual-tower attention layer.
[0058] (2) Feature extraction of the dual-tower attention layer: The dual-tower attention layer consists of two parallel attention sub-layers: a text sequence attention layer (tower 1) and a patent structure attention layer (tower 2). The text sequence attention layer performs self-attention operation on the masked embedding vector matrix and obtains the guided self-attention weight matrix and patent module weight matrix through word frequency-inverse document frequency attention. The module semantic feature vector matrix with fused context semantic features is obtained through matrix multiplication. The patent structure attention layer receives the masked embedding vector matrix and the guided patent module weight matrix, performs self-attention operation and fuses the guided patent module weight matrix to generate a structure attention weight matrix. The module structure feature vector matrix with fused module structure priority features is obtained through matrix multiplication. The outputs of the two sub-layers are input to the cross-layer fusion gating layer.
[0059] (3) Cross-layer fusion gated layer feature fusion: The cross-layer fusion gated layer receives the semantic feature vector matrix and the structural feature vector matrix of the module, and obtains the domain fusion coefficient corresponding to the technical field to which the training sample belongs; dynamically adjusts the dual-tower fusion coefficient according to the domain fusion coefficient (the fusion coefficient of tower 1 is equal to the domain fusion coefficient, and the fusion coefficient of tower 2 is equal to 1 minus the domain fusion coefficient); perform token-by-tower feature fusion on the two types of feature vector matrices according to the dual-tower fusion coefficient to generate a preliminary module fusion feature vector matrix; then, introduce the sigmoid activation function to calculate the gate coefficient corresponding to each token, filter the redundant features in the preliminary module fusion feature vector matrix through the gate coefficient, and finally generate the module fusion feature vector matrix corresponding to each module text, and output it to the task output layer.
[0060] 3. Calculation of prediction and fusion loss values for the task output layer
[0061] The module fusion feature vector matrix output from the encoding layer is input to the task output layer of the patent feature extraction model. The task output layer performs three types of prediction tasks based on this matrix, calculates the corresponding loss values for each, and fuses them to obtain the fusion loss value. The specific process is as follows: (1) Mask position token prediction and cross-entropy loss calculation: The task output layer performs token prediction on all mask positions marked in the domain adaptive mask layer based on the module fusion feature vector matrix, and obtains the token prediction result for each mask position; at the same time, it retrieves the real token corresponding to each mask position (i.e., the token that originally existed at the mask position before the patent module text was masked) stored in the system in advance; compares the token prediction result of each mask position with the corresponding real token, and calculates the cross-entropy loss value using the cross-entropy loss formula. This loss value is used to quantify the prediction error of the model for the mask position token.
[0062] (2) Sentence Pair Fine-Grained Distance Prediction and Mean Squared Error Loss Calculation: The task output layer performs fine-grained distance prediction on all sentence pairs within each module text based on the module fusion feature vector matrix. According to the total number of sentences in the module, the corresponding adaptive granularity partitioning rule is adopted to obtain the predicted fine-grained distance for each group of sentence pairs. At the same time, the pre-labeled fine-grained distance labels corresponding to each group of sentence pairs are retrieved (i.e., the actual distance between sentences labeled according to the logical structure of the patent module text). The predicted fine-grained distance of each group of sentence pairs is compared with the corresponding fine-grained distance labels, and the mean squared error loss value is calculated using the mean squared error loss formula. This loss value is used to quantify the prediction error of the model on the logical association distance between sentences within the patent module text.
[0063] (3) Prediction of the module text to which the token belongs and calculation of classification loss value: The task output layer predicts the module text to which all tokens belong within each module text based on the module fusion feature vector matrix, and obtains the predicted module text to which each token belongs; at the same time, it retrieves the patent structure embedding vector corresponding to each token (this vector contains the real attribute features of the module text to which the token belongs); compares the predicted module text to which each token belongs with the real module text attributes represented by the corresponding patent structure embedding vector, and calculates the classification loss value using the classification loss formula. This loss value is used to quantify the prediction error of the model for the module text to which the token belongs.
[0064] (4) Calculation of fusion loss value: The task output layer calculates the cross-entropy loss value, mean square error loss value and classification loss value obtained above by weighting and summing them according to the preset loss fusion weight, and obtains the fusion loss value of this training iteration. This fusion loss value is used to quantify the overall prediction error of the model in this iteration.
[0065] 4. Model parameter update and optimization
[0066] After the task output layer calculates the fusion loss value, it is propagated back to each sub-layer of the patent feature extraction model, including the input layer, encoding layer, and task output layer, through the backpropagation algorithm. Based on the error gradient obtained from backpropagation, all trainable parameters of the model are updated and optimized, including the embedding vector mapping parameters of the input layer, the self-attention weight parameters of the encoding layer, the gating layer coefficients, and the prediction parameters of the task output layer. By adjusting the parameters, the overall prediction error of the model is reduced, and the model's extraction accuracy and adaptability for patent text features in the mechanical field are improved, completing a single training iteration.
[0067] Repeat steps 1 to 4 of the above single training iteration process, sequentially inputting all prepared mechanical domain patent training samples into the model for training. After completing one full training cycle for all samples, proceed to the next training cycle, continuously optimizing model performance through sample input, feature processing, loss calculation, and parameter optimization. Preset training termination conditions include the maximum number of training iterations and the minimum fusion loss threshold. During training, the fusion loss value and training iteration count are monitored in real time for each training cycle. When the number of training iterations reaches the preset maximum number of training iterations, or the model's fusion loss value decreases to the preset minimum fusion loss threshold and remains stable (no significant decrease in fusion loss value over multiple consecutive cycles), model training is stopped. At this point, the parameters of the patent feature extraction model have reached their optimal state, enabling accurate extraction of semantic, structural, and domain features from mechanical domain patent texts, which can be used for subsequent patent retrieval, matching, and other related tasks.
[0068] After training the patent feature extraction model based on historical patent texts, extract all module fusion feature vector matrices from the historical patent texts, and store them in the patent feature library to be matched, in the form of multiple vectors according to the module fusion feature vector matrix of each patent text.
[0069] In step 102, after obtaining the user's needs, the feature word set of the user's needs is extracted. Based on the technical field and word frequency distribution corresponding to the user's needs, the parameters of the BM25 algorithm are dynamically adjusted. Using the BM25 algorithm with adjusted parameters, patent texts are screened from the patent feature library to be matched based on the feature word set to obtain a coarse-ranked patent text set. The coarse-ranked patent text set includes the module fusion feature vectors of the screened patent texts.
[0070] Users with patent transaction needs can complete the full entry of relevant company information and required key technical information on a pre-defined interactive page. This step is the core foundation for ensuring the accuracy of subsequent patent matching; the completeness and detail of the information directly determine the accuracy of semantic parsing and the relevance of the matching results. Specific content may include: Enterprise Information Filling: Users must truthfully fill in the basic information of the enterprise entity and transaction-related auxiliary information, including but not limited to the full name of the enterprise, unified social credit code, registered address, industry sector (must be selected from the system's preset industry classification list, consistent with the "technical field label" in the patent feature extraction model mentioned above, such as machinery, communications, chemical industry, etc.), business scope, patent transaction intention (clearly specifying whether it is to purchase, transfer, or license the patent, focusing on the scenario of purchasing the patent here), expected transaction budget, and description of the patent application scenario (briefly explaining how the required patent technology will be applied to the enterprise's product research and development, production process, or technology upgrade needs). The above information must be filled in one by one according to the system prompts to ensure that the information is true, complete, and without omissions or false information, providing a basic support for subsequent accurate matching of fields and precise positioning of needs.
[0071] Required Patent Technology Information: This section is the core content. Users need to fill in the relevant information of the required patent technology in detail, specifically including: the field to which the required patent technology belongs (it must strictly correspond to the "Technology Field Label" in the patent feature extraction model above. If the user's required technology involves multiple cross-fields, multiple field labels can be selected, and the priority weight of each field can be marked, such as 60% for the mechanical field and 40% for the electronic field), and an overview of the key technical points of the required technology. The overview of the key technical points should be controlled between 200 and 500 words, and should follow the principle of "the more words, the clearer the technical description, and the more accurate the matching result".
[0072] To guide users in filling in standardized and detailed key technical points, a filling guidance template can be provided (the template content is consistent with the patent module splitting logic in the patent feature extraction model mentioned above). This guides users to describe the technical problem from dimensions such as background, core technical solutions, technical implementation details, technical effects, and detailed application scenarios. For example, if a user needs patent search technology related to the mechanical field, they should describe in detail: "Our company needs an efficient search technology applicable to mechanical parts patents, capable of solving the problems of semantic ambiguity, low search accuracy, and high false negative rate in existing search technologies for mechanical patent terminology; the core technical solution must include a feature extraction mechanism based on modular splitting of patent text, capable of accurately extracting technical features from the claims and embodiments of mechanical patents; the technical implementation process must integrate a mechanical field-specific terminology database to enhance the identification of specific terms such as 'mechanical structure' and 'parts processing'; the technical effect must achieve a search recall rate of no less than 95% and an accuracy rate of no less than 90%, applicable to technical verification and infringement investigation scenarios after the company's mechanical patent procurement."
[0073] By guiding users to fill in key technical points according to the patent module logic, the user's needs are accurately connected with the "patent modularization" and "technical feature extraction" logic in the patent feature extraction model mentioned above. This lays the foundation for subsequent semantic parsing, feature extraction and matching, and reduces matching deviations caused by ambiguous technical descriptions.
[0074] The key technical point summary text filled in by the user is preprocessed. The processing flow refers to the preprocessing logic of the input layer of the patent feature extraction model mentioned above. Specifically, it includes: removing meaningless characters (such as punctuation marks, special symbols, blank characters, etc.) and redundant expressions (such as repeated technical descriptions, polite phrases unrelated to the core technology, etc.) from the text. After that, a coarsely sorted set of patent texts is obtained.
[0075] In this embodiment, a set of feature words representing user needs is extracted. Based on the technical field and word frequency distribution corresponding to the user needs, the parameters of the BM25 algorithm are dynamically adjusted. Using the parameter-adjusted BM25 algorithm, patent texts are screened from the patent feature library to be matched, based on the feature word set, to obtain a coarsely ranked patent text set. This includes: segmenting the token sequence corresponding to the user needs based on a patent terminology definition table, and using an n-gram algorithm to assist in segmentation, adding all segmented words to the feature word set; expanding the technical terms in the segmented words with synonyms and near-synonyms based on the patent terminology definition table, and adding them to the feature word set; dynamically adjusting the word frequency saturation coefficient and document length normalization coefficient of the BM25 algorithm based on the technical field and word frequency distribution corresponding to the user needs; using the parameter-adjusted BM25 algorithm, calculating the inverse document frequency of each feature word using different formulas based on whether each feature word in the feature word set exists in the patent feature library to be matched, and calculating the relevance score between the feature word set and each patent text in the patent feature library to be matched based on the inverse document frequency of each feature word; and selecting a first number of patent texts from high to low according to the relevance score to obtain a coarsely ranked patent text set.
[0076] In the above embodiments, firstly, based on the patent terminology definition table, the technical terms in the word segmentation are expanded with synonyms and near-synonyms, and when added to the feature word set, for example, "computer" is expanded to "computer + computer + laptop", thus expanding the search scope and avoiding missed detections due to differences in terminology.
[0077] Subsequently, based on the technical fields and word frequency distribution corresponding to user needs, the word frequency saturation coefficient and document length normalization coefficient of the BM25 algorithm are dynamically adjusted. Different users have varying lengths and technical fields. Using fixed values for the word frequency saturation coefficient k1 and document length normalization coefficient b would lead to a decrease in recall accuracy (e.g., long text patents are prone to word frequency redundancy, while short text patents are prone to insufficient word frequency). Therefore, this step dynamically adjusts the values of k1 and b to address these differences. The default values are preset to k1=1.2 and b=0.75, with the specific adjustment rules as follows: The b-value is adjusted according to the technical field: For example, patent texts in the communications field are mostly long and densely packed with technical terms. To strengthen the normalization effect of document length and avoid the impact of redundant words in long texts on recall results, the b-value is adjusted to 0.81. Patent texts in the machinery field are mostly short and have clear technical features. To weaken the normalization effect of document length and avoid underestimating the weight of core words in short texts, the b-value is adjusted to 0.55. For medium-to-long text patents in the chemical industry, the b-value is adjusted to 0.78 to ensure that patent texts of different fields and lengths can obtain reasonable normalization processing.
[0078] Adjust the k1 value according to the word frequency distribution of the user input text: After segmenting the user's requirements, count the proportion of high-frequency words (words that appear ≥3 times) in all word segments. If the proportion of high-frequency words exceeds 20%, it indicates that the core technical terms in the user's requirements have a high degree of repetition. In this case, adjust the k1 value to 1.5 to enhance the weight of high-frequency words (core technical terms) and highlight the influence of core technical features. If the proportion of high-frequency words is ≤20%, keep the default value of k1=1.2 to avoid low-frequency words being overly ignored.
[0079] The formula for BM25 is as follows:
[0080] in, The relevance score is given by n, where n is the number of feature words in the feature word set. For the i-th feature word, Let i be the inverse document frequency of the i-th feature word. Let be the frequency of the i-th feature word in all module-fused feature vectors of the patent text, D be the length of the patent text, be the sum of the lengths of all module-fused feature vectors of the patent text, and avgdl be the average length of all patent texts in the patent feature library to be matched.
[0081] when When it exists in the patent feature library to be matched N represents the total number of patent texts in the patent feature database to be matched, and df represents the number of patent texts containing the word qi.
[0082] when When it does not exist in the patent feature library to be matched .
[0083] In step 103, based on the patent feature extraction model, a module fusion feature vector matrix corresponding to the user needs is generated. Based on the module fusion feature vector matrix corresponding to the user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
[0084] In this embodiment, based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain a fine-ranked patent text set. This includes: extracting the domain embedding vector of each patent text in the coarse-ranked patent text set; calculating the domain similarity between each domain embedding vector and the domain embedding vector in the module fusion feature vector matrix corresponding to user needs; removing patent texts with domain similarity lower than a similarity threshold from the coarse-ranked patent text set; optimizing the term frequency-inverse document frequency (TF-IF) score of the module fusion feature vector matrix corresponding to user needs and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set to obtain the user-side TF-IF feature matrix and the TF-IF feature matrix of each patent text; calculating the fused token attention weight for each patent text based on the token attention weight in the user-side TF-IF feature matrix and the module fusion feature vector matrix corresponding to user needs; and filtering patent texts from the coarse-ranked patent text set based on the user-side TF-IF feature matrix, the TF-IF feature matrix of each patent text, and the fused token attention weight of each patent text to obtain a fine-ranked patent text set.
[0085] In the above embodiments, the domain embedding vector is the core sub-vector representing the "technical field" in the module fusion feature vector matrix, corresponding to the specific technical field of the user's needs (such as machinery, communication, chemical industry, etc.). Its generation logic is completely consistent with the domain embedding vector generation rules of the patent feature extraction model mentioned above. Operationally, firstly, a domain embedding vector of a preset dimension is extracted from the module fusion feature vector matrix of each patent in the coarse-ranked set; then, using the domain embedding vector in the user's needs module fusion feature vector matrix as a benchmark, the domain similarity between the patent domain embedding vector and the user's needs domain embedding vector is calculated for each patent using the vector cosine similarity algorithm, quantitatively evaluating the degree of fit between the two technical fields. The core purpose of this step is to perform preliminary screening at the domain level, excluding patents that are cross-domain or have excessive domain deviation, reducing the invalid workload of subsequent screening, and laying the foundation for accurate matching.
[0086] The similarity threshold is a preset standard for determining domain suitability (which can be dynamically adjusted according to the degree of technical field segmentation, for example, set to 85%). The threshold setting needs to be combined with the domain priority of user needs (e.g., the threshold for single-domain needs can be higher, while the threshold for cross-domain needs can be appropriately lowered). Operationally, the domain similarity of each patent is compared with the preset threshold, and patents with similarity below the threshold are directly eliminated. These patents deviate too much from the technical field of the user's needs, and even if there is some overlap in the core technology, they are unlikely to meet the user's actual technical needs. The patents retained after elimination form a "domain suitability candidate subset", ensuring that subsequent screening revolves around patents in the same or highly suitable domain as the user's needs, improving screening efficiency and accuracy.
[0087] The core of term frequency-inverse document frequency (TF-IDF) score optimization is to quantify the distinctiveness and coreness of each technical term (token) to solve the problem of "non-core terms interfering with matching accuracy". The optimization logic is completely consistent with the TF-IDF optimization rules mentioned above. Operationally, the module fusion feature vector matrix for user needs and the module fusion feature vector matrix for each patent in the domain adaptation candidate subset are processed separately: First, the word frequency information of all tokens in the matrix is extracted, and combined with the overall corpus of the coarsely ranked patent set, the TF-IDF score of each token is calculated (word frequency reflects the frequency of the token in a single text / user needs, and inverse document frequency reflects the scarcity of the token in the entire patent set; the higher the score, the stronger the coreness and distinguishability of the token); then, the TF-IDF scores are organized in the format of "token × module" to form the user-side TF-IDF feature matrix and the TF-IDF feature matrix of each patent. The dimensions of both are completely consistent with the original module fusion feature vector matrix, which can be directly used for subsequent attention weight calculation and similarity matching.
[0088] The token attention weights (original weights) in the fusion feature vector matrix of the user requirement module are allocated through the dual-tower attention layer of the encoding layer during matrix generation, and are used to identify the core importance of each token in the user requirement. The user-side TF-IDF feature matrix is used to correct these original weights, strengthening the weight of core technical terms and weakening the interference of redundant terms. The calculation process is as follows: First, the original attention weights of each token are extracted from the fusion feature vector matrix of the user requirement module. Then, the TF-IDF score of the corresponding token is extracted from the user-side TF-IDF feature matrix. The weighted attention weight of the user requirement token is obtained by the formula "original weight × (1 + TF-IDF score)". Then, the matching between the token features of each patent and the user requirement token is combined, and the user-side weighted attention weights are fused with the relevant features of the patent-side tokens. Finally, the fused token attention weight (W_final) for each patent is calculated. The core function of this weighting is to identify which technical terms in each patent are highly relevant to the core needs of users, providing a key basis for accurate matching in the future.
[0089] Step 103 is the core of the fine-grained ranking and screening process, integrating all previous optimization results to achieve "precise matching and selection of the best among the best." Operationally, the attention weights of the fused tokens for each patent are first embedded into the user-side TF-IDF feature matrix and the patent's own TF-IDF feature matrix, respectively, to enhance core features (the TF-IDF scores corresponding to high-weight tokens are further enhanced, while low-weight redundant tokens are suppressed). Then, using the optimized BERT model, the semantic similarity and technical feature matching degree between the enhanced user-side TF-IDF feature matrix and the enhanced patent TF-IDF feature matrix are calculated for each patent. Combined with the core guidance of the fused token attention weights, the final matching score for each patent is output. Finally, all patents are sorted from highest to lowest according to their final matching scores, and a predetermined number (i.e., the second number, e.g., 100) of high-matching patents are selected, while patents with low scores and insufficient core technical fit are removed, ultimately obtaining the fine-ranked patent text set. The core advantage of this collection is that all patents have undergone a full-process screening process of domain adaptation → TF-IDF optimization → attention weight enhancement → precise matching. The technical features are highly aligned with user needs, which can directly provide users with high-quality patent candidates to support subsequent decisions on patent transactions and technology applications.
[0090] In this embodiment, the fused token attention weight for each patent text is calculated based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to the user's needs. This includes: calculating a user-side weighted attention weight matrix based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to the user's needs; calculating a weighted attention weight matrix for each patent text based on the token attention weights in the term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix of each patent text; and fusing the user-side weighted attention weight matrix with the weighted attention weight matrix of each patent text to obtain the fused token attention weight for each patent text.
[0091] In the above embodiments, the user-side term frequency-inverse document frequency feature matrix is used to characterize the importance and distinctiveness of each technical term (token) in the user's needs, and the token attention weights in the module fusion feature vector matrix corresponding to the user's needs represent the model's original attention to each technical term. By weighting both, the weights of high-distinctiveness and high-core technical terms can be further strengthened on the basis of the original attention, while weakening the influence of redundant and low-frequency terms, thereby obtaining a user-side weighted attention matrix that better reflects the user's actual technical needs.
[0092] For each patent text in the coarse-ranked set, the same calculation logic as the user demand side is adopted. The original token attention weights in the module fusion feature vector matrix are weighted and corrected using the patent's own term frequency-inverse document frequency feature matrix. This makes the patent side's attention weights more focused on its own core technical features, resulting in a weighted attention weight matrix for each patent text. This ensures that the weight calculation rules on the user side and the patent side are consistent and the features can be aligned.
[0093] The user-side weighted attention matrix is matched one-to-one with the corresponding patent text's weighted attention matrix, and then fused using a weighted approach. This process combines the core focus of user needs with the distribution of the patent's technical features to generate a fused token attention weight. This weight accurately reflects which technical terms in the patent are highly relevant to user needs, providing a crucial basis for further screening, similarity calculation, and the generation of a refined patent set.
[0094] In this embodiment, the module fusion feature vector matrix corresponding to user needs and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set are optimized by term frequency-inverse document frequency (TNF) scoring to obtain the user-side TNF feature matrix and the TNF feature matrix of each patent text. This includes: generating a first token TNF matrix based on the module fusion feature vector matrix corresponding to user needs; generating a second token TNF matrix for each patent text based on the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set; and optimizing the first token TNF matrix and the second token TNF matrix based on a preset patent module weight matrix. The term frequency matrix is weighted and weighted to obtain the third and fourth token term frequency matrices. Logarithmic smoothing is then applied to the third and fourth token term frequency matrices to obtain the fifth and sixth token term frequency matrices. Thresholding is then applied to the fifth and sixth token term frequency matrices to obtain the seventh and eighth token term frequency matrices. Based on the seventh and eighth token term frequency matrices, a user-side term frequency-inverse document frequency feature matrix and a term frequency-inverse document frequency feature matrix for each patent text are generated.
[0095] In the above embodiments, for the module fusion feature vector matrix M_Q corresponding to user needs, the original word frequencies of all tokens (i.e., the number of times each technical term appears in the text of each module in M_Q) are extracted to generate a first token word frequency matrix tf_Q (rows = tokens, columns = text of each module in M_Q, cell value = the original word frequency of the token in the corresponding module text); for the module fusion feature vector matrix M_D of a single patent text in the coarse-sorted patent text set: using the same logic as M_Q, the original word frequencies of all tokens are extracted to generate a second token word frequency matrix tf_D; the original word frequency of each cell in tf_Q is multiplied by a preset patent... The weight coefficients of the corresponding modules in the module weight matrix are used to obtain the third token frequency matrix tf_Q1 (cell value = original term frequency × module weight). Simultaneously, the tokens in M_Q are aligned one-to-one with the tokens in tf_Q1 to ensure that the weighted term frequency of each token accurately corresponds to its module position in M_Q. Using the same weight coefficients as tf_Q, the original term frequency of each cell in tf_D is multiplied by the corresponding module weight coefficient to obtain the fourth token frequency matrix tf_D1, simultaneously aligning the tokens in M_D with those in tf_D1. The smoothed term frequency = log(1) is then applied to tf_Q1 and tf_D1 respectively. The process involves using "+ weighted word frequency" to avoid feature redundancy caused by excessively high original weighted word frequencies and to reduce interference from extremely high-frequency tokens. The cell value of tf_Q2 is calculated as log(1 + the corresponding cell value of tf_Q1), which generates a smoothed weighted word frequency matrix tf_Q2 on the user demand side, referred to as the fifth token word frequency matrix. The cell value of tf_D2 is calculated as log(1 + the corresponding cell value of tf_D1), which generates a smoothed weighted word frequency matrix tf_D2 on the single patent text side, referred to as the sixth token word frequency matrix.
[0096] Threshold restriction processing: Based on the preset TF value threshold for the technical field (e.g., threshold = 0.3 for the mechanical field, threshold = 0.4 for the communication field), tf_Q2 and tf_D2 are filtered as follows: For tf_Q2: Tokens with cell values < the field threshold are removed (considered as redundant low-frequency tokens), and tokens with cell values ≥ the field threshold are retained to obtain the filtered tf_Q2 (still denoted as tf_Q2); For tf_D2: Using the same field threshold as tf_Q2, redundant low-frequency tokens are removed to obtain the filtered tf_D2 (still denoted as tf_D2).
[0097] IDF value calculation: For each token (denoted as q) in tf_Q2: count the number of patents containing token q (denoted as df_q, i.e. the number of patents containing q in M_D of D_set); For each token (denoted as d) in tf_D2: use the same logic as q to count the number of patents containing token d (denoted as df_d).
[0098] The IDF value calculation rules (adapted to vector matrix scenarios, consistent with the exception handling logic described above) are as follows: When df_q > 0 (token q exists in D_set): idf_q = log((N - df_q + 0.5) / (df_q + 0.5) + 1), where N = 5000; when df_q = 0 (token q is an out-of-vocabulary term and does not exist in D_set): idf_q = log(N + 0.85). The calculation of idf_d for token d follows the same rules as idf_q to ensure consistent calculation logic.
[0099] Domain-based secondary weighting and weighting (integrated into this step to ensure logical coherence): Multiply the calculated idf_q and idf_d by the weighting coefficient α of the corresponding technology field (consistent with the previous steps, such as α=1.1 for the mechanical field and α=0.88 for the agricultural field) to obtain the domain-weighted idf values (idf_q'=idf_q×α, idf_d'=idf_d×α). The final TF-IDF feature matrix is generated as follows: On the user demand side, the smoothed weighted term frequency of each cell in tf_Q2 is multiplied by the corresponding token's idf_q' to obtain the optimized TF-IDF feature matrix tfidf_Q (rows = tokens, columns = modules of M_Q, cell value = smoothed weighted term frequency × idf_q'), which can be called the user-side term frequency-inverse document frequency feature matrix; On the single patent text side, the smoothed weighted term frequency of each cell in tf_D2 is multiplied by the corresponding token's idf_d' to obtain the optimized TF-IDF feature matrix tfidf_D (rows = tokens, columns = modules of M_D, cell value = smoothed weighted term frequency × idf_d'), which can be called the term frequency-inverse document frequency feature matrix for each patent text.
[0100] In this embodiment, based on the user-side term frequency-inverse document frequency feature matrix, the term frequency-inverse document frequency feature matrix of each patent text, and the fused token attention weight of each patent text, patent texts are filtered from the coarse-ranked patent text set to obtain a fine-ranked patent text set. This includes: embedding the fused token attention weight of each patent text into the module fusion feature vector matrix corresponding to the user requirement and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set to obtain the enhanced module fusion feature vector matrix corresponding to the user requirement and the enhanced module fusion feature vector matrix of each patent text; calculating the similarity score between the enhanced module fusion feature vector matrix corresponding to the user requirement and the enhanced module fusion feature vector matrix of each patent text; and filtering a second number of patent texts from high to low according to the similarity score to obtain the fine-ranked patent text set.
[0101] In the above embodiment, the attention weight (W_final) of the fused token is the core parameter obtained through attention-guided weight fusion mentioned earlier. Essentially, it combines the TF-IDF scores from the user demand side and the patent side to assign a coreity weight to each technical term (token). The higher the weight value, the more critical the technical feature corresponding to that token (such as core technical points in user demands or core protection points in patents). The embedding operation specifically involves element-wise multiplying the W_final corresponding to each patent with the module fusion feature vector matrix (M_Q) of the user demand and the module fusion feature vector matrix (M_D) of the patent itself to achieve weight embedding—that is, the feature vector of each token in the matrix is weighted and strengthened by its corresponding W_final. This operation highlights the vector representation of core technical features in user needs, strengthens key technical features in patents that match user needs, and suppresses interference from non-core, redundant tokens (low weight). The result is an enhanced user needs module fusion feature vector matrix (M_Q_enhanced) and an enhanced module fusion feature vector matrix for each patent (M_D_enhanced), laying the foundation for accurate similarity calculation and avoiding matching deviations caused by interference from non-core features.
[0102] Similarity calculation is performed using the optimized BERT model described earlier. This model is specifically adapted for patent retrieval scenarios, incorporating a matrix feature attention mechanism and the modularity of patent text. It can accurately capture the semantic association and technical feature matching degree between two sets of enhanced feature matrices. The calculation process is as follows: The enhanced user demand feature matrix (M_Q_enhanced) is used as a baseline. Each patent in the coarse-ranked patent set is then compared with its enhanced feature matrix (M_D_enhanced) to the optimized BERT model. The model first aligns the module structure and token features of the two sets of matrices (ensuring a one-to-one comparison between user needs and corresponding modules and technical terms in the patent). Then, using the vector space cosine similarity algorithm, it quantifies the degree of fit between the two sets of matrices in terms of technical semantics, module adaptability, and core feature matching. Finally, it outputs a similarity score for each patent. The score range is uniformly normalized to 0-100. The higher the score, the higher the fit between the patent's technical features and the core demands of the user, and the stronger the accuracy of the technical matching.
[0103] The similarity scores of all patents in the coarse-ranked patent text set are sorted, strictly following the rule of "score from high to low," ensuring that patents with the highest matching degree are selected first. The second quantity is a preset target number for fine-ranking selection (based on the actual patent search scenario, such as 100 articles). This number can be dynamically adjusted according to actual search accuracy requirements and user complexity, but must be less than the number in the coarse-ranked patent set. The selection process involves choosing the top two quantities of patents from the sorted list, and removing patents ranked lower or with low similarity scores (below a preset threshold, which can be set as needed), ultimately forming the fine-ranked patent text set. The core feature of this set is that all patents have undergone core feature enhancement and accurate similarity calculation, with technical features highly aligned with user needs. Compared to the coarse-ranked set, the accuracy is significantly improved, directly providing users with high-quality patent candidates to support subsequent patent transactions and selection decisions.
[0104] This invention also provides a patent matching device based on user needs, as described in the following embodiments. Since the principle by which this device solves the problem is similar to that of the patent matching method based on user needs, the implementation of this device can refer to the implementation of the patent matching method based on user needs, and repeated details will not be elaborated further.
[0105] Figure 4 This is a structural diagram of a patent matching device based on user needs in an embodiment of the present invention, such as... Figure 4 As shown, the device includes: The patent feature library construction module 401 is used to input historical patent texts into the patent feature extraction model, generate multiple module fusion feature vector matrices for each historical patent text, vectorize all module fusion feature vector matrices, and store them in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture, and integrates a word frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. The coarse ranking module 402 is used to extract the feature word set of the user's needs after obtaining the user's needs, dynamically adjust the parameters of the BM25 algorithm according to the technical field and word frequency distribution corresponding to the user's needs, and use the BM25 algorithm with adjusted parameters to filter patent texts from the patent feature library to be matched according to the feature word set to obtain a coarse ranking patent text set. The coarse ranking patent text set includes the module fusion feature vector of the selected patent texts. The fine-ranking module 403 is used to generate a module fusion feature vector matrix corresponding to user needs based on the patent feature extraction model, and to filter patent texts from the coarse-ranked patent text set according to the module fusion feature vector matrix corresponding to user needs, so as to obtain the fine-ranked patent text set.
[0106] In this embodiment, the patent feature extraction model includes: The input layer is used to generate an embedding vector matrix corresponding to each module text of the patent text based on the patent text, the patent terminology definition table, the technical field tags, and the preset patent structure tags. The encoding layer is used to fuse contextual semantics and module structure priority in the embedding vector matrix corresponding to each module text, and generate a module fusion feature vector matrix based on the domain fusion coefficient; The task output layer is used to perform mask position token prediction, sentence pair fine-grained distance prediction, and token module text prediction based on the module fusion feature vector matrix. It calculates the corresponding loss based on the predicted mask position token prediction result, sentence pair fine-grained distance, and token module text. All calculated losses are fused to obtain the fusion loss. The parameters of the patent feature extraction model are updated through backpropagation based on the fusion loss.
[0107] In this embodiment, the input layer is used to: split each patent text into a module text sequence according to a preset patent structure label, and form a patent structure embedding vector; segment each module text in the module text sequence into a token sequence, and map the token sequence into a fixed-dimensional token embedding vector; encode the position information of each token sequence in the module text sequence to form a position embedding vector; convert the technical field label of each patent text into a domain embedding vector; map the patent term definitions in the patent term definition table into an external knowledge embedding vector; extract three-dimensional technical features of function, behavior and structure from each module text, form triples, and form a three-dimensional technical feature embedding vector; encode the examination process and legal status of the patent text to form a life cycle embedding vector; for each module text, obtain a feature word importance feature vector based on the extracted word frequency-inverse document frequency features and the generated word frequency-inverse document frequency word weight vector; and concatenate the patent structure embedding vector, token embedding vector, position embedding vector, domain embedding vector, external knowledge embedding vector, three-dimensional technical feature embedding vector, life cycle embedding vector and feature word importance feature vector by dimension to form an embedding vector matrix corresponding to each module text.
[0108] In this embodiment, the coding layer includes: The domain-adaptive masking layer is used to perform a first probability mask on the preset core patent terms in each embedded vector matrix according to preset masking rules, and to perform a second probability mask on the preset general words, wherein the first probability is greater than the second probability; semantic completion is performed on the tokens of each of the first probability mask and the second probability mask to obtain the masked embedded vector matrix and the mask loss value of the text module; A dual-tower attention layer is used to perform self-attention operations on the masked post-embedded vector matrix. It introduces word frequency-inverse document frequency attention-guided weight calculation logic to generate guided self-attention weight matrix and structural attention weight matrix. The guided self-attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module semantic feature vector that integrates contextual semantics. The guided structural attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module structural feature vector that integrates module structural priority. A cross-layer fusion gating layer is used to dynamically adjust the dual-tower fusion coefficients according to the fusion coefficients of the patent's domain. According to the dual-tower fusion coefficients, the module semantic feature vector and the module structural feature vector are fused token by token to obtain a preliminary module fusion feature vector matrix. The preliminary module fusion feature vector matrix is filtered by the gating coefficient to obtain the module fusion feature vector matrix.
[0109] In this embodiment, the task output layer is used to: predict the token at the mask position in the first probability mask and the second probability mask based on the module fusion feature vector matrix; calculate the cross-entropy loss based on the token prediction result and the actual token at the mask position; predict the fine-grained distance of sentence pairs in each module content based on the module fusion feature vector matrix; calculate the mean squared error loss based on the predicted fine-grained distance and fine-grained distance label for each sentence pair; wherein the fine-grained distance is adaptively divided according to the total number of sentences in the module text; predict the module text to which each token belongs in each module text based on the module fusion feature vector matrix; calculate the classification loss based on the predicted module text to which the token belongs and the patent structure embedding vector; and optimize the parameters of the patent feature extraction model through backpropagation based on the fusion cross-entropy loss, mean squared error loss, and classification loss.
[0110] In this embodiment, the coarse-ranking module is used to: segment the token sequence corresponding to the user's needs based on the patent terminology definition table, and use the n-gram algorithm to assist in the segmentation, adding all the segmented words to the feature word set; expand the technical terms in the segmented words with synonyms and near-synonyms based on the patent terminology definition table, and add them to the feature word set; dynamically adjust the word frequency saturation coefficient and document length normalization coefficient of the BM25 algorithm according to the technical field and word frequency distribution corresponding to the user's needs; use the parameter-adjusted BM25 algorithm to calculate the inverse document frequency of each feature word using different formulas based on whether each feature word in the feature word set exists in the patent feature library to be matched, and calculate the relevance score between the feature word set and each patent text in the patent feature library to be matched based on the inverse document frequency of each feature word; and filter the first number of patent texts from high to low according to the relevance score to obtain the coarse-ranked patent text set.
[0111] In this embodiment, the fine-ranking module is used to: extract the domain embedding vector of each patent text in the coarse-ranked patent text set; calculate the domain similarity between each domain embedding vector and the domain embedding vector in the module fusion feature vector matrix corresponding to the user requirement; remove patent texts with domain similarity lower than the similarity threshold from the coarse-ranked patent text set; optimize the term frequency-inverse document frequency score of the module fusion feature vector matrix corresponding to the user requirement and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set to obtain the user-side term frequency-inverse document frequency feature matrix and the term frequency-inverse document frequency feature matrix of each patent text; calculate the fused token attention weight for each patent text based on the token attention weight in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to the user requirement; and filter patent texts from the coarse-ranked patent text set based on the user-side term frequency-inverse document frequency feature matrix, the term frequency-inverse document frequency feature matrix of each patent text, and the fused token attention weight of each patent text to obtain the fine-ranked patent text set.
[0112] In this embodiment, the fine-ranking module is used to: calculate a user-side weighted attention weight matrix based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to user needs; calculate a weighted attention weight matrix for each patent text based on the term frequency-inverse document frequency feature matrix of each patent text and the token attention weights in the module fusion feature vector matrix of each patent text; and fuse the user-side weighted attention weight matrix with the weighted attention weight matrix of each patent text to obtain the fused token attention weight for each patent text.
[0113] In this embodiment, the fine-ranking module is used to: generate a first token frequency matrix based on the module fusion feature vector matrix corresponding to the user's needs; generate a second token frequency matrix for each patent text based on the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set; weight the first and second token frequency matrices according to the preset patent module weight matrix to obtain a third and a fourth token frequency matrix; perform logarithmic smoothing on the third and fourth token frequency matrices to obtain a fifth and a sixth token frequency matrix; perform threshold limiting on the fifth and sixth token frequency matrices to obtain a seventh and an eighth token frequency matrix; and generate a user-side term frequency-inverse document frequency feature matrix and a term frequency-inverse document frequency feature matrix for each patent text based on the seventh and eighth token frequency matrices.
[0114] In this embodiment, the fine-ranking module is used to: embed the fused token attention weight of each patent text into the module fusion feature vector matrix corresponding to the user demand and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set, to obtain the enhanced module fusion feature vector matrix corresponding to the user demand and the enhanced module fusion feature vector matrix of each patent text; calculate the similarity score between the enhanced module fusion feature vector matrix corresponding to the user demand and the enhanced module fusion feature vector matrix of each patent text; and filter a second number of patent texts from high to low according to the similarity score to obtain the fine-ranked patent text set.
[0115] This invention also provides a computer device. Figure 5 This is a schematic diagram of a computer device in an embodiment of the present invention. The computer device 500 includes a memory 510, a processor 520, and a computer program 530 stored in the memory 510 and executable on the processor 520. When the processor 520 executes the computer program 530, it implements the above-mentioned patent matching method based on user needs.
[0116] This invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned patent matching method based on user needs.
[0117] This invention also provides a computer program product, which includes a computer program that, when executed by a processor, implements the aforementioned patent matching method based on user needs.
[0118] Compared to existing technologies that decompose user needs into basic keywords and then directly perform full-text retrieval in patent databases, the method of this invention extracts patent features using a Bidirectional Encoder Representation Model (BERT) based on the Transformer architecture. This overcomes the limitations of traditional feature extraction models that only focus on textual semantics and ignore domain characteristics and text granularity differences. The term frequency-inverse document frequency mechanism weakens the interference of common redundant words, adapting to the characteristics of patent texts with dense technical terminology and high domain differentiation. Simultaneously, combined with a multi-granularity sentence distance supervision mechanism, it can accurately capture the semantic relationships and feature differences of different modules of the patent text (title, abstract, claims, etc.). The generated module fusion feature vector matrix can comprehensively and accurately represent the core technical features of each historical patent. This feature vector matrix is vectorized and stored in the patent feature library to be matched, ensuring that the patent features in the feature library have high recognizability and high completeness, providing reliable feature support for subsequent fast and accurate matching, and reducing the problems of missed detections and false detections caused by feature extraction bias. The coarse-ranking module overcomes the shortcomings of fixed parameters in existing BM25 algorithms by dynamically adjusting the core parameters of the BM25 algorithm based on the technical field and word frequency distribution corresponding to user needs. This allows for flexible parameter adjustment to adapt to different scenario requirements, considering the characteristics of patent texts in different technical fields (such as communications, machinery, and chemical engineering) and the differences in word frequency distribution of the user's demand feature word set (e.g., different proportions of high-frequency core technical terms). This avoids problems such as low recall accuracy and excessive redundant patents caused by fixed parameters. Using the parameter-adjusted BM25 algorithm, combined with the user's demand feature word set, patents are screened from the patent feature library to be matched. This quickly identifies a coarse-ranked set of patent texts that initially aligns with the user's demand technical field and core features. This ensures high recall (avoiding the omission of potential matching patents) while effectively controlling the size of the coarse-ranked set, eliminating a large number of patents irrelevant to user needs, significantly reducing the ineffective workload of subsequent fine-ranking, and improving the overall efficiency of patent matching. The fine-ranking module employs a patent feature extraction model consistent with historical patent feature extraction, generating a module-fused feature vector matrix corresponding to user needs. This ensures complete consistency in the extraction logic and format between user-demand features and historical patent features, avoiding matching deviations caused by inconsistencies in feature dimensions and extraction rules. Simultaneously, using the module-fused feature vector matrix corresponding to user needs as the core search condition, it precisely filters from the coarse-ranked patent text set. This focuses on the core technical features of user needs, achieving a deep technical match between user needs and patent texts. Compared to existing fine-ranking methods based solely on keyword matching, this significantly improves the accuracy of the fine-ranked patent text set. The final output of the fine-ranked patent text set accurately matches the user's personalized technical needs, providing high-quality candidate solutions for subsequent patent transactions and technical references, reducing the time cost for users to screen patents, and enhancing the user experience.Through a complete process design of "patent feature database construction - coarse ranking and screening - fine ranking and screening," a complete patent matching closed loop is formed: the optimized design of the patent feature extraction model enables it to adapt to patent texts in different technical fields; the dynamic adjustment of the parameters of the BM25 algorithm enables it to adapt to the word frequency distribution and domain characteristics of different user needs; and the dual-stage screening mode of coarse and fine ranking ensures both matching efficiency and accuracy. The entire technical solution requires no manual intervention in parameter adjustment and feature screening, has a high degree of automation, and can be stably applied to patent matching scenarios with multiple technical fields and diverse user needs. It solves the problems of weak adaptability, poor stability, and high manual costs of existing patent matching technologies, and has broad application prospects.
[0119] In this embodiment of the invention, the hierarchical structure of the patent feature extraction model is clearly defined. Through the collaborative work of the input layer, encoding layer, and task output layer, the semantic, structural, and domain features of the patent text are accurately integrated. At the same time, the model parameters are optimized by multi-task loss fusion, which further improves the extraction accuracy of the module fusion feature vector matrix and enhances the generalization ability of the model.
[0120] In this embodiment of the invention, the feature extraction logic of the input layer is further refined, and multiple types of embedding vectors such as patent structure, token, location, domain, and external knowledge are integrated to comprehensively capture the core technical features and additional information of the patent text, reduce feature loss, and make the generated module text embedding vector matrix more complete and professional, laying a better foundation for subsequent encoding layer processing.
[0121] In this embodiment of the invention, the accuracy and domain adaptability of the module fusion feature vector matrix are further improved by optimizing the coding layer structure, strengthening the feature expression of core patent terms through domain adaptive masking, accurately fusing contextual semantics and module structure priority with the help of dual-tower attention layers, and dynamically adapting to the characteristics of patents in different domains by combining cross-layer fusion gating layers.
[0122] In this embodiment of the invention, multi-task collaborative training (mask token prediction, sentence pair fine-grained distance prediction, and token-to-module text prediction) is also used to integrate multi-type loss backpropagation optimization models, which effectively solves the feature extraction bias problem caused by single-task training and improves the model's ability to capture and extract patent text features.
[0123] In this embodiment of the invention, the logic for extracting user demand feature words is optimized, and the integrity of feature words is ensured by word segmentation using a professional vocabulary list and expansion of synonyms. At the same time, the core parameters of the BM25 algorithm are dynamically adjusted to adapt to different fields and word frequency distributions, and the inverse document frequency calculation method is optimized to further improve the recall accuracy and efficiency of coarse ranking and reduce coarse ranking redundancy.
[0124] In this embodiment of the invention, the fine-ranking and screening process is further refined. First, cross-domain patents are eliminated twice by domain similarity. Then, after TF-IDF score optimization and token attention weight calculation after fusion, the core technical features are focused on to achieve precise fine-ranking and screening, which greatly improves the fit between the fine-ranked patent text set and user needs.
[0125] In this embodiment of the invention, the calculation logic of the attention weight of the fused token is also clarified. By calculating the weighted attention weight matrices of the user side and the patent side separately and then fusing them, the rationality of the weight fusion is ensured, the correlation between the patent and the core features of user needs is accurately reflected, and more reliable weight support is provided for fine ranking and screening, thereby improving the accuracy of fine ranking and matching.
[0126] In this embodiment of the invention, the TF-IDF score optimization process is further refined by using steps such as module weighting, logarithmic smoothing, and threshold limiting to weaken redundant low-frequency token interference, strengthen the feature expression of core technical terms, make the generated TF-IDF feature matrix more distinctive, and improve the accuracy of subsequent attention weight calculation and fine-tuning.
[0127] In this embodiment of the invention, the final selection logic of the fine-ranking is further improved by defining the final selection logic, embedding attention weights to strengthen the core feature vector matrix, and combining similarity score sorting. This further enhances the accuracy and efficiency of the fine-ranking selection, ensuring that the final output set of fine-ranked patent texts consists of highly matched candidates that meet the user's personalized needs. Those skilled in the art should understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Moreover, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0128] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0129] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0130] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0131] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A patent matching method based on user needs, characterized in that, include: Historical patent texts are input into the patent feature extraction model to generate multiple module fusion feature vector matrices for each historical patent text. All module fusion feature vector matrices are vectorized and stored in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture and integrates a term frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. After obtaining user needs, the feature word set of user needs is extracted. Based on the technical field and word frequency distribution corresponding to user needs, the parameters of the BM25 algorithm are dynamically adjusted. Using the BM25 algorithm with adjusted parameters, patent texts are screened from the patent feature library to be matched based on the feature word set to obtain a coarse-ranked patent text set. The coarse-ranked patent text set includes the module fusion feature vector of the screened patent texts. Based on the patent feature extraction model, a module fusion feature vector matrix corresponding to user needs is generated. Based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
2. The method as described in claim 1, characterized in that, Patent feature extraction models include: The input layer is used to generate an embedding vector matrix corresponding to each module text of the patent text based on the patent text, the patent terminology definition table, the technical field tags, and the preset patent structure tags. The encoding layer is used to fuse contextual semantics and module structure priority in the embedding vector matrix corresponding to each module text, and generate a module fusion feature vector matrix based on the domain fusion coefficient; The task output layer is used to perform mask position token prediction, sentence pair fine-grained distance prediction, and token module text prediction based on the module fusion feature vector matrix. It calculates the corresponding loss based on the predicted mask position token prediction result, sentence pair fine-grained distance, and token module text. All calculated losses are fused to obtain the fusion loss. The parameters of the patent feature extraction model are updated through backpropagation based on the fusion loss.
3. The method as described in claim 2, characterized in that, The input layer is used for: Each patent text is split into a module text sequence according to a preset patent structure label, and a patent structure embedding vector is formed. Each module text in the module text sequence is segmented into a token sequence, and the token sequence is mapped into a fixed-dimensional token embedding vector; Encode the position information of each token sequence in the module text sequence to form a position embedding vector; Convert the technical field tag of each patent text into a domain embedding vector; Map the patent terminology definitions in the patent terminology definition table to external knowledge embedding vectors; Extract functional, behavioral, and structural three-dimensional technical features from the text of each module, form triples, and then form a three-dimensional technical feature embedding vector. The examination process and legal status of patent texts are encoded into a lifecycle embedding vector; For each module text, the feature word importance feature vector is obtained based on the extracted term frequency-inverse document frequency features and the generated term frequency-inverse document frequency word weight vector; The embedding vectors of patent structure, token, location, domain, external knowledge, three-dimensional technical features, life cycle, and feature word importance are concatenated according to their dimensions to form the embedding vector matrix corresponding to each module text.
4. The method as described in claim 2, characterized in that, The coding layer includes: The domain-adaptive masking layer is used to perform a first probability mask on the preset core patent terms in each embedded vector matrix according to preset masking rules, and to perform a second probability mask on the preset general words, wherein the first probability is greater than the second probability; semantic completion is performed on the tokens of each of the first probability mask and the second probability mask to obtain the masked embedded vector matrix and the mask loss value of the text module; A dual-tower attention layer is used to perform self-attention operations on the masked post-embedded vector matrix. It introduces word frequency-inverse document frequency attention-guided weight calculation logic to generate guided self-attention weight matrix and structural attention weight matrix. The guided self-attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module semantic feature vector that integrates contextual semantics. The guided structural attention weight matrix is multiplied with the masked post-embedded vector matrix to obtain the module structural feature vector that integrates module structural priority. A cross-layer fusion gating layer is used to dynamically adjust the dual-tower fusion coefficients according to the fusion coefficients of the patent's domain. According to the dual-tower fusion coefficients, the module semantic feature vector and the module structural feature vector are fused token by token to obtain a preliminary module fusion feature vector matrix. The preliminary module fusion feature vector matrix is filtered by the gating coefficient to obtain the module fusion feature vector matrix.
5. The method as described in claim 2, characterized in that, The task output layer is used for: Based on the module fusion feature vector matrix, predict the token at the mask position in the first probability mask and the second probability mask. Calculate the cross-entropy loss based on the token prediction result at the mask position and the actual token at the mask position. Based on the module fusion feature vector matrix, the fine-grained distance of sentence pairs in the content of each module is predicted. The mean squared error loss is calculated based on the predicted fine-grained distance and fine-grained distance label of each sentence pair. The fine-grained distance is adaptively divided according to the total number of sentences in the module text. Based on the module fusion feature vector matrix, predict the module text to which each token belongs in each module text. Calculate the classification loss based on the predicted module text to which the token belongs and the patent structure embedding vector. Based on the fusion of cross-entropy loss, mean squared error loss, and classification loss, the parameters of the patent feature extraction model are optimized through backpropagation.
6. The method as described in claim 1, characterized in that, Extract the feature word set of user needs. Based on the technical field and word frequency distribution corresponding to the user needs, dynamically adjust the parameters of the BM25 algorithm. Using the parameter-adjusted BM25 algorithm, filter patent texts from the patent feature library to be matched based on the feature word set to obtain a coarsely ranked patent text set, including: The token sequence corresponding to the user's needs is segmented based on the patent terminology definition table, and the n-gram algorithm is used to assist in the segmentation. All segmented words are added to the feature word set. Based on the patent terminology definition table, the technical terms in the word segmentation are expanded with synonyms and near-synonyms, and then added to the feature word set; Based on the technical field and word frequency distribution corresponding to user needs, dynamically adjust the word frequency saturation coefficient and document length normalization coefficient of the BM25 algorithm; The BM25 algorithm with adjusted parameters is used. The inverse document frequency of each feature word is calculated using different formulas based on whether each feature word in the feature word set exists in the patent feature library to be matched. The relevance score between the feature word set and each patent text in the patent feature library to be matched is calculated based on the inverse document frequency of each feature word. The patent texts are selected from the highest to the lowest relevance scores to obtain a coarsely ranked set of patent texts.
7. The method as described in claim 1, characterized in that, Based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set, including: Extract the domain embedding vector of each patent text in the coarse-ranked patent text set, and calculate the domain similarity between each domain embedding vector and the domain embedding vector in the module fusion feature vector matrix corresponding to the user needs; Patent texts with a domain similarity below the similarity threshold are removed from the coarse-ranked patent text set; The term frequency-inverse document frequency (TNF) scores of the module fusion feature vector matrix corresponding to user needs and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set are optimized to obtain the user-side TNF feature matrix and the TNF feature matrix of each patent text. Based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to user needs, the fused token attention weights for each patent text are calculated. Based on the user-side term frequency-inverse document frequency feature matrix, the term frequency-inverse document frequency feature matrix of each patent text, and the fused token attention weight of each patent text, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
8. The method as described in claim 7, characterized in that, Based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to user needs, the fused token attention weights for each patent text are calculated, including: Calculate the user-side weighted attention weight matrix based on the token attention weights in the user-side term frequency-inverse document frequency feature matrix and the module fusion feature vector matrix corresponding to user needs; Calculate the weighted attention weight matrix for each patent text based on the term frequency-inverse document frequency feature matrix and the token attention weight in the module fusion feature vector matrix of each patent text; The user-side weighted attention weight matrix is fused with the weighted attention weight matrix of each patent text to obtain the fused token attention weight for each patent text.
9. The method as described in claim 7, characterized in that, The term frequency-inverse document frequency (IF-IVF) scores of the module fusion feature vector matrix corresponding to user needs and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set are optimized to obtain the user-side IF-IVF feature matrix and the IF-IVF feature matrix of each patent text, including: Based on the module fusion feature vector matrix corresponding to the user's needs, the first token frequency matrix is generated. Based on the module fusion feature vector matrix of each patent text in the coarse-sorted patent text set, a second token frequency matrix is generated for each patent text. Based on the preset patent module weight matrix, the first token frequency matrix and the second token frequency matrix are weighted and weighted respectively to obtain the third token frequency matrix and the fourth token frequency matrix. Log smoothing is performed on the third and fourth token frequency matrices respectively to obtain the fifth and sixth token frequency matrices. Threshold limiting is applied to the fifth and sixth token word frequency matrices respectively to obtain the seventh and eighth token word frequency matrices; Based on the term frequency matrix of the seventh token and the term frequency matrix of the eighth token, generate the term frequency-inverse document frequency feature matrix for the user side and the term frequency-inverse document frequency feature matrix for each patent text.
10. The method as described in claim 7, characterized in that, Based on the user-side term frequency-inverse document frequency feature matrix, the term frequency-inverse document frequency feature matrix of each patent text, and the fused token attention weight of each patent text, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set, including: Embed the attention weight of the fused token of each patent text into the module fusion feature vector matrix corresponding to the user needs and the module fusion feature vector matrix of each patent text in the coarse-ranked patent text set to obtain the enhanced module fusion feature vector matrix corresponding to the user needs and the enhanced module fusion feature vector matrix of each patent text. Calculate the similarity score between the enhanced module fusion feature vector matrix corresponding to the enhanced user needs and the enhanced module fusion feature vector matrix of each patent text. The second number of patent texts are selected from high to low based on their similarity scores to obtain a set of finely formatted patent texts.
11. A patent matching device based on user needs, characterized in that, include: The patent feature library construction module is used to input historical patent texts into the patent feature extraction model, generate multiple module fusion feature vector matrices for each historical patent text, vectorize all module fusion feature vector matrices, and store them in the patent feature library to be matched. The patent feature extraction model is based on a bidirectional encoder representation model based on the Transformer architecture, and integrates a word frequency-inverse document frequency mechanism and a multi-granularity sentence distance supervision mechanism. The coarse ranking module is used to extract the feature word set of user needs after obtaining user needs, dynamically adjust the parameters of the BM25 algorithm according to the technical field and word frequency distribution corresponding to the user needs, and use the parameter-adjusted BM25 algorithm to filter patent texts from the patent feature library to be matched according to the feature word set to obtain a coarse ranking patent text set. The coarse ranking patent text set includes the module fusion feature vector of the selected patent texts. The fine-ranking module is used to generate a module fusion feature vector matrix corresponding to user needs based on the patent feature extraction model. Based on the module fusion feature vector matrix corresponding to user needs, patent texts are filtered from the coarse-ranked patent text set to obtain the fine-ranked patent text set.
12. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method of any one of claims 1 to 10.
13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method of any one of claims 1 to 10.
14. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the method of any one of claims 1 to 10.