An enhanced search augmentation generation system and method for the legal domain

By constructing a multi-source heterogeneous legal knowledge base, refining the processing, and enhancing the graph structure, the RAG system addresses the issues of personalization, retrieval accuracy, and systematic understanding of the legal field, thereby improving the personalization and interpretability of legal services.

CN122240780APending Publication Date: 2026-06-19TONGFANG KNOWLEDGE DIGITAL PUBLISHING TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TONGFANG KNOWLEDGE DIGITAL PUBLISHING TECH CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing RAG systems suffer from problems in the legal field, such as a single and unpersonalized knowledge base, crude legal text processing, insufficient retrieval relevance, and a lack of systematic legal understanding, resulting in insufficient accuracy and interpretability of legal services.

Method used

We construct a multi-source heterogeneous legal knowledge base, employing refined legal text processing, multi-stage retrieval and rearrangement mechanisms, and graph structure enhancement modules. Combined with models and knowledge graphs optimized for the legal field, we enhance the personalized service capabilities and retrieval accuracy of legal texts.

Benefits of technology

It has achieved personalized enhancements in legal services, improved search accuracy and relevance, enhanced the depth and systematic understanding of legal reasoning, and provided more interpretable answers.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240780A_ABST
    Figure CN122240780A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of artificial intelligence application technology, specifically relating to an enhanced retrieval and generation system and method for the legal domain. The system includes: a multi-source heterogeneous legal knowledge base construction and management module, used to dynamically adjust the retrieval weights of public and personal knowledge bases based on task type; a legal text refinement module, used for structure recognition, semantic perception segmentation, and structured metadata annotation of legal text; a multi-stage retrieval and reordering module, used to perform multi-stage retrieval and reordering using a text embedding model and cross-encoder model finely tuned for the legal domain; a graph structure enhancement module, used to expand or recalibrate the contextual relevance of retrieval results using a legal knowledge graph; and a modular process execution engine, used to schedule the above modules and execute an end-to-end process from legal question to answer generation. The method uses the system for enhanced retrieval of legal questions. This invention significantly improves the factual consistency, reasoning depth, and personalized service capabilities of large language models in complex legal scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence application technology, specifically to an enhanced retrieval and generation system and method for the legal field, in order to improve the accuracy, interpretability, and personalization capabilities of large language models in tasks such as legal question answering, legal text generation, compliance analysis, and case retrieval. Background Technology

[0002] Retrieval-enhanced generation (RAG) technology, by combining external knowledge bases with the generative capabilities of Large Language Models (LLMs), effectively mitigates the "illusion" problem that LLMs may produce and improves the accuracy and traceability of generated content. However, directly applying general-purpose RAG systems to the legal field faces numerous challenges: 1. Limited and unpersonalized knowledge base: Existing systems typically use a single, static public knowledge base, which cannot effectively integrate the private knowledge of lawyers or law firms, such as internal case notes and client contract templates, making it difficult to provide personalized legal services and decision support based on historical experience.

[0003] 2. Crude Legal Text Processing: Legal texts are characterized by rigorous structure, high semantic density, and numerous technical terms. Common fixed-length text segmentation methods easily disrupt the inherent logical structure of legal texts, such as severing the argumentative relationship between "the plaintiff's claims" and "the court holds," resulting in incomplete context and decreased relevance of retrieved text fragments.

[0004] 3. Insufficient Relevance: Relying solely on vector models trained on general corpora for semantic similarity retrieval makes it difficult to accurately match legal issues with complex and specialized legal grounds. The lack of a deep re-ranking mechanism for legal semantics results in search results that may contain superficially relevant but essentially invalid fragments.

[0005] 4. Lack of systematic legal understanding: Existing RAG systems usually provide legal provisions or cases as isolated knowledge points to the model. The model lacks an understanding of the systematic relationships between legal entities such as provisions, concepts, and cases, including their inherent reference, application, and conflicts, which affects the depth and accuracy of complex legal reasoning.

[0006] Therefore, there is an urgent need for a RAG system architecture that is comprehensively enhanced to address the characteristics of the legal field, in order to solve the above problems. Summary of the Invention

[0007] One of the objectives of this invention is to address the problem of a single and unpersonalized knowledge base when general RAG systems are directly applied to the legal field. This invention provides an enhanced retrieval and generation system and method for the legal field, which constructs and integrates a multi-source heterogeneous legal knowledge base that distinguishes between public and personal knowledge bases, and supports dynamic priority configuration to improve the personalization and practicality of legal services.

[0008] The second objective of this invention is to address the problem of crude legal text processing encountered when general-purpose RAG systems are directly applied to the legal field. It provides an enhanced retrieval and generation system and method for the legal field, which avoids contextual breaks by adopting a refined semantic perception segmentation strategy that combines legal text structure, thereby improving the relevance and completeness of retrieval fragments.

[0009] The third objective of this invention is to address the problem of insufficient retrieval relevance when the general RAG system is directly applied to the legal field, and to provide an enhanced retrieval and generation system and method for the legal field. This system improves the ability to accurately locate high-value evidence from massive amounts of legal knowledge by introducing a multi-stage retrieval-reordering mechanism based on a fine-tuning model for the legal field.

[0010] The fourth objective of this invention is to address the problem of a lack of systematic legal understanding when general-purpose RAG systems are directly applied to the legal field. It provides an enhanced retrieval and generation system and method for the legal field, which, by integrating GraphRAG capabilities, constructs and utilizes the relationships between legal entities to enhance the systematic understanding and reasoning ability of large language models of legal knowledge. This comprehensively improves the factual consistency, reasoning depth, and interpretability of tasks such as legal question answering, legal text generation, compliance analysis, and case retrieval.

[0011] To achieve the aforementioned objectives, in one aspect, the present invention provides an enhanced search and generation system for the legal field, comprising: The module for constructing and managing multi-source heterogeneous legal knowledge bases is used to construct and differentiate between public and personal knowledge bases, and to configure a knowledge source weight mechanism that dynamically adjusts the retrieval weights of the two types of knowledge bases based on task type. The legal text refinement module is used to perform structural recognition and semantic perception segmentation on legal texts, and to annotate the segmented text blocks with structured metadata; The multi-stage retrieval and reordering module first uses a text embedding model optimized for the legal domain to perform initial vector similarity screening, and then uses a cross-encoder model fine-tuned for the legal domain to reorder the relevance of the initial screening results; the graph structure enhancement module is used to construct a legal knowledge graph based on entities and relationships extracted from legal texts, and use this graph to expand or recalibrate the contextual relevance of the retrieval results after retrieval; the modular workflow execution engine is used to schedule the above modules and execute the end-to-end workflow from document ingestion, processing, indexing to query response and answer generation.

[0012] As one feasible approach, the knowledge source weighting mechanism is configurable, allowing users or the system to dynamically adjust the retrieval weights of the two types of knowledge bases based on task type. Specifically, this includes: Determine the task type for the task submitted by the user; The public knowledge base recall quota ratio α is preset according to the determined task type, and the corresponding personal knowledge base recall quota ratio is 1-α, where α∈[0,1]. Based on the recall quota ratio between the public knowledge base and the personal knowledge base, the tasks proposed by users are retrieved in the public knowledge base and / or the personal knowledge base.

[0013] As one possible approach, task type determination can be achieved through any of the following methods: Method 1: Automatically classify task types using a large language model; Method 2: Fine-tune BERT to create a small classification model for automatic task classification; Method 3: The user manually selects the option on the front-end interface.

[0014] As one feasible approach, the legal text refinement module includes: The legal text structure recognition unit has a built-in legal text structure parsing rule library, which is used to accurately identify standard logical paragraphs in legal texts by combining keyword matching, typesetting features, contextual semantic coherence and paragraph length distribution multi-dimensional clues. Semantic-aware segmentation units are used to segment legal texts using standard logical paragraphs as basic units, and to perform secondary aggregation sub-segmentation on ultra-long paragraphs based on the semantic similarity between sentences. Legal element protection units are used to enforce a non-segmentation strategy on units in legal texts that have independent legal significance, thereby ensuring their integrity. Multi-granularity text blocks and metadata annotation units are used to generate multi-granularity text blocks from the same legal text and to annotate the text blocks with structured metadata.

[0015] As one possible approach, the multi-stage retrieval and reordering module incorporates a multi-stage retrieval and reordering mechanism, performing the process of finding and filtering the most relevant text blocks from the knowledge base, including two sequential stages: Phase 1: Initial screening using vector retrieval based on vector similarity All segmented text blocks in the knowledge base are converted into document vectors in the early stage using a text embedding model optimized for the legal domain, and then stored in a vector database; The user query is transformed into a query vector using a text embedding model optimized for the legal domain. In the vector database, the cosine similarity between the query vector and all document vectors is calculated to quickly recall the top N candidate text blocks with the highest similarity. Phase 2: Fine-grained reordering based on cross-encoders The fine-tuned cross-encoder model is used to refine the reordering of the N candidate text blocks recalled in the first stage, specifically including the following steps: The user query is concatenated with each candidate text block to form an input sequence, which is then fed into a fine-tuned cross-encoder model for joint encoding, resulting in a re-scoring score representing the relevance of each candidate text block. All candidate blocks are re-ranked based on the re-scoring scores, and the top K text blocks are selected as high-value contexts and fed into the large language model to generate the final answer.

[0016] As one possible approach, the optimized text embedding model in the legal domain is obtained by further pre-training and fine-tuning the general text embedding model on a large-scale legal corpus. The goal of fine-tuning is to make legal expressions with similar semantics closer in the vector space, while legal expressions that are superficially similar but semantically different are farther apart in the vector space. The text embedding model optimized for the legal domain employs a contrastive learning strategy during training, where positive sample pairs consist of legal issues and related text blocks, while negative sample pairs consist of legal issues and unrelated text blocks.

[0017] As one possible approach, the fine-tuned cross-encoder model is obtained by fine-tuning the general cross-encoder model on labeled data in the legal field. The legal domain labeled data consists of the following three parts: positive samples: "legal issues - relevant text blocks" pairs; relevant text blocks are text blocks in the knowledge base that are related to legal issues; random negative samples: "legal issues - irrelevant text blocks" pairs; irrelevant text blocks are text blocks in the knowledge base that are not related to legal issues; hard negative samples: "legal issues - superficially relevant but not substantially supported text blocks" pairs; superficially relevant but not substantially supported text blocks are text blocks that were recalled through the first-stage vector retrieval but were judged to be irrelevant by humans; all samples have been reviewed and labeled by legal professionals. The training process for fine-tuning the general cross-encoder model includes: combining and concatenating each legal question and its corresponding text block into a single input sequence; the general cross-encoder model outputs a scalar score; The training objective uses a pairwise ranking loss function, where positive samples score higher than negative samples for the same problem. Training is performed on a dedicated legal GPU cluster, using tens of thousands to hundreds of thousands of labeled data pairs, and undergoes multiple iterations until the ranking metric on the validation set converges.

[0018] As one possible implementation method, the graph structure enhancement module includes: The legal knowledge graph construction unit is used to extract legal entities from text blocks using a finely tuned named entity recognition model in the legal domain; identify semantic relationships between legal entities by combining rules and models; and construct a legal knowledge graph with legal entities as nodes and semantic relationships as edges. The graph query and context enhancement unit is used to extract legal entities from the reordered text blocks, obtain semantic paths between legal entities by querying the legal knowledge graph to expand the context, and / or use graph neural networks to generate graph enhancement features based on entity nodes and their neighbors to perform secondary recalibration of the relevance of the text blocks.

[0019] To achieve the above objectives, in a second aspect, the present invention also provides an enhanced retrieval and generation method for the legal field, which, using the aforementioned enhanced retrieval and generation system for the legal field, includes the following steps: S1: Document Processing and Indexing: Retrieve legal documents and distinguish their public or personal attributes; perform structural recognition and semantic-aware segmentation to generate text blocks with metadata; vectorize the text blocks using a legal domain-optimized text embedding model and store them in a vector database index according to knowledge base type; S2: Query Processing: Receive user queries, determine their task type, and determine the proportion of retrieval from public and personal knowledge base indexes based on preset weights; S3: Retrieval and Reordering: First, recall candidate text blocks from a specified index using vector similarity; then, use a legal domain-fine-tuned cross-encoder model to re-score and sort the candidate blocks based on relevance, selecting the K most relevant text blocks; S4: Graph Augmentation: Extract legal entities from the K text blocks, query the legal knowledge graph, and obtain legal entity association information to enhance the generation context; S5: Answer Generation: Combine the K text blocks with the legal entity association information obtained from the legal knowledge graph to form an enhanced context, input it into a large language model to generate the final answer, and attach citation traceability information for the text blocks used.

[0020] As one possible approach, in S2, the task types include at least: legal and regulatory provisions search, similar case retrieval, legal text drafting, and compliance risk assessment, with each type of task corresponding to a different public knowledge base recall quota ratio α.

[0021] As one possible approach, in S4, obtaining legal entity association information includes: finding the shortest semantic path between legal entity nodes in the legal knowledge graph and converting the path information into natural language text.

[0022] Compared with the prior art, the present invention has the following beneficial technical effects: 1. Significantly enhanced personalized service capabilities: By distinguishing and integrating public and personal knowledge bases and supporting dynamic configuration based on task type, the system can provide authoritative public legal information while deeply integrating users' historical experience and private data, achieving truly personalized legal assistance.

[0023] 2. Significantly Improved Retrieval Accuracy and Relevance: First, the refined segmentation strategy based on the structure of legal documents ensures the semantic integrity and logical consistency of the retrieved fragments. Second, the use of a two-stage retrieval rearrangement based on a text embedding model fine-tuned in the legal domain and a cross-encoder model greatly enhances the ability to accurately locate core legal basis from professional texts and reduces interference from irrelevant or low-value information.

[0024] 3. Enhanced Depth and Systematic Understanding of Legal Reasoning: By integrating GraphRAG capabilities, the system not only provides isolated legal provisions or cases but also reveals the inherent connections between legal knowledge. This enables the large language model to perform deeper logical reasoning, understand the applicable context of legal provisions, conflict resolution, and the evolutionary relationships between cases, generating answers that are more legally rigorous and persuasive.

[0025] 4. Improved system interpretability and controllability: The entire process is modular and traceable. Answers include clear source citations, allowing users to trace the retrieval process. Knowledge base weights, task types, and other parameters are configurable, making system behavior more transparent and controllable. Attached Figure Description

[0026] Figure 1 This is a schematic diagram of an embodiment of the enhanced retrieval and generation system for the legal field according to the present invention; Figure 2 This is a flowchart illustrating an embodiment of the enhanced retrieval and generation method for the legal field according to the present invention. Detailed Implementation

[0027] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

[0028] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0029] The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings and specific embodiments.

[0030] This invention provides an embodiment of an enhanced retrieval and generation system for the legal field, which significantly improves the factual consistency, reasoning depth, and personalized service capabilities of large language models in complex legal scenarios while ensuring legal rigor.

[0031] See Figure 1 The enhanced search and generation system for the legal field described in this embodiment includes: The module for constructing and managing multi-source heterogeneous legal knowledge bases is used to construct and differentiate between public and personal knowledge bases, and to configure a knowledge source weight mechanism that dynamically adjusts the retrieval weights of the two types of knowledge bases based on task type. The legal text refinement module is used to perform structural recognition and semantic perception segmentation on legal texts, and to annotate the segmented text blocks with structured metadata; The multi-stage retrieval and reordering module first uses a text embedding model optimized for the legal domain to perform initial vector similarity screening, and then uses a cross-encoder model fine-tuned for the legal domain to reorder the relevance of the initial screening results; the graph structure enhancement module is used to construct a legal knowledge graph based on entities and relationships extracted from legal texts, and use this graph to expand or recalibrate the contextual relevance of the retrieval results after retrieval; the modular workflow execution engine is used to schedule the above modules and execute the end-to-end workflow from document ingestion, processing, indexing to query response and answer generation.

[0032] In this embodiment, as one possible approach, the public knowledge base stores publicly available legal data, including but not limited to the full text of laws and regulations, guiding cases of the Supreme People's Court, publicly available judgments and judicial interpretations. The full text of laws and regulations includes but is not limited to the full text of the Civil Code. The data sources of the public knowledge base include but are not limited to official databases, publicly available legal APIs from CNKI, and open-source legal corpora. Official databases include but are not limited to China Judgments Online. Access to the public knowledge base is limited to all users. The personal knowledge base stores users' private legal data, including but not limited to lawyers' private case notes, client information, internal memos, internal precedent notes, and unpublished agency opinions. Client information includes but is not limited to client contract templates. The data in the personal knowledge base comes from user uploads or synchronization with the law firm's internal system. Access to the personal knowledge base is limited to the user or their authorized team. The two types of knowledge bases can be physically or logically isolated and managed by configuring different storage database tables or storage partitions, thereby differentiating them for retrieval.

[0033] In this embodiment, as one possible implementation method, the knowledge source weight mechanism is configurable, allowing users or the system to dynamically adjust the retrieval weights of the two types of knowledge bases according to the task type, specifically including: Determine the task type for the task submitted by the user; The public knowledge base recall quota ratio α is preset according to the determined task type, and the corresponding personal knowledge base recall quota ratio is 1-α, where α∈[0,1]. Based on the recall quota ratio between the public knowledge base and the personal knowledge base, the tasks proposed by users are retrieved in the public knowledge base and / or the personal knowledge base.

[0034] In this embodiment, as one possible approach, the task type determination can be achieved through any of the following methods: Method 1: Automatically classify task types using a large language model; Method 2: Fine-tune BERT to create a small classification model for automatic task classification; Method 3: The user manually selects the option on the front-end interface.

[0035] In this embodiment, as one possible approach, the task types include, but are not limited to, legal and regulatory provisions search, similar case retrieval, legal text drafting, and compliance risk assessment. For searches of legal provisions, such as "What is the content of Article 584 of the Civil Code?", this type of task does not require personalization and is purely public knowledge. Therefore, α=1.0 is set, and the search will be 100% conducted from the public knowledge base. For similar case searches, such as "judgments on similar equity transfer disputes in Beijing", this type of task mainly relies on public precedents, supplemented by personal experience. With α=0.7, 70% of the searches are conducted from public knowledge bases and 30% from personal knowledge bases. For legal text drafting, such as "drafting a confidentiality agreement", this type of task prioritizes using lawyers' own templates, setting α=0.3, with 70% of the data retrieved from personal knowledge bases and 30% from public knowledge bases; For compliance risk assessment, this type of task needs to integrate industry general rules with the client's past violation records, setting α=0.4, with 40% of the data retrieved from public knowledge bases and 60% from personal knowledge bases.

[0036] In this embodiment, as one possible approach, the legal text refinement processing module includes: a legal text structure recognition unit, a semantic awareness segmentation unit, a legal element protection unit, and multi-granularity text block generation and metadata annotation.

[0037] In this embodiment, as one possible approach, the legal text structure recognition unit has a built-in legal text structure parsing rule base; The legal text structure parsing rule base is built on the structural analysis of a large number of publicly available legal texts and the structural annotations of legal professionals. It sets the title, format, layout features, contextual semantics and length distribution of standard logical paragraphs in various types of legal texts such as judgments, rulings, mediation agreements, and complaints. The layout features include font, indentation and paragraph spacing, and the format includes indentation, line breaks and unnumbered lists. The legal text structure recognition unit is based on a legal text structure parsing rule base and combines multiple clues such as keyword matching, typesetting features, contextual semantic coherence, and paragraph length distribution to accurately identify standard logical paragraphs in legal texts. For example, in legal texts, when the phrase "plaintiff's claim" is detected in a paragraph, and combined with the format of the paragraph that follows it, the legal text structure recognition unit marks the paragraph as "plaintiff's claim paragraph"; When the phrase "This Court holds" or "This Court holds" appears in a paragraph, the legal text structure recognition unit marks the paragraph as a "legal reasoning paragraph"; When the phrase "the judgment is as follows" is detected in a paragraph, the legal text structure recognition unit identifies the content following that phrase as the "main text of the judgment"; The legal text structure recognition unit not only relies on keyword matching to identify standard logical paragraphs in legal texts, but also combines multi-dimensional clues such as typesetting features, contextual semantic coherence, and paragraph length distribution, thereby improving the robustness of legal text structure recognition and enabling accurate segmentation even when faced with legal texts with slightly different formats.

[0038] In this embodiment, as one possible approach, considering the characteristics of legal texts being highly specialized and having high semantic density, the semantic-aware segmentation unit adopts a refined document segmentation strategy. Combined with the structure of the legal text, it performs semantic-aware segmentation of the legal text, avoiding the context breakage problem caused by traditional fixed-length segmentation, thereby improving retrieval relevance. Specifically, the semantic perception segmentation unit uses the standard logical paragraphs identified by the legal text structure recognition unit as the basic unit to segment the legal text. For example, a "court ascertained" paragraph, regardless of its length, will be retained in the same text block. Building on this foundation, for extremely long paragraphs, such as complex fact-finding paragraphs exceeding 2,000 words, the semantic-aware segmentation unit will further employ a secondary segmentation strategy based on semantic coherence. First, it will segment the paragraphs according to natural sentences. Then, it will use a semantic vector model trained on a legal corpus to calculate the semantic similarity between adjacent sentences. Finally, it will aggregate semantically related consecutive sentences into a sub-block, ensuring that each sub-block revolves around the same factual focus or legal dispute. At the same time, it will retain a small amount of overlapping content at the boundaries of the sub-blocks, such as the last 1-2 sentences repeating at the beginning of the next sub-block, to maintain contextual coherence.

[0039] In this embodiment, as one possible approach, the legal element protection unit implements a mandatory non-segmentation strategy for units in the legal text that have independent legal significance, such as complete legal citations, paragraphs containing party information, lists of evidence, and key points of judgments, to ensure their integrity.

[0040] For example, regarding legal citations, "Article 584 of the Civil Code of the People's Republic of China" and its immediately following explanatory text are retained as a whole; The information paragraphs for the parties involved, such as the paragraphs describing the identities of the plaintiff, defendant, and third party, should be kept intact. The evidence list, such as "Evidence 1: XXXX; Evidence 2: XXXX", is treated as a single logical unit. In guiding cases, the "Key Points of Judgment" section is presented in a separate block for easy retrieval.

[0041] In this embodiment, as one possible approach, the multi-granularity text blocks and metadata annotation units generate multi-granularity text blocks from the same legal text in order to accommodate different retrieval needs. For example, coarse-grained blocks, such as the entire "Court Reasoning" section, are suitable for queries that require an overall understanding of the judgment logic; fine-grained blocks, such as the analysis paragraph of a single legal dispute, are suitable for accurately locating a certain legal viewpoint. Multi-granularity text blocks and metadata annotation units annotate each text block with structured metadata, including block type, cause of action, list of cited legal provisions, legal level and region. The block type includes "plaintiff's claim", "applicable law" and "fact finding", and the cause of action includes "sales contract dispute" and "labor dispute". This metadata is not only used for subsequent retrieval filtering and reordering, but also provides node attributes for graph structure construction.

[0042] In this embodiment, as one possible implementation, the multi-stage retrieval and reordering module incorporates a multi-stage retrieval and reordering mechanism to perform the process of searching and filtering the most relevant text blocks from the knowledge base, including two sequential stages: Phase 1: Initial screening using vector retrieval based on vector similarity All segmented text blocks in the knowledge base are converted into document vectors in the early stage using a text embedding model optimized for the legal domain, and then stored in a vector database; The user query is transformed into a query vector using a text embedding model optimized for the legal domain. In the vector database, the cosine similarity between the query vector and all document vectors is calculated to quickly recall the top N=100 candidate text blocks with the highest similarity. Phase 2: Fine-grained reordering based on cross-encoders The fine-tuned cross-encoder model is used to refine and re-rank the N=100 candidate text blocks recalled in the first stage. The specific steps include the following: The user query and each candidate text block are concatenated into an input sequence, which is then fed into a fine-tuned cross-encoder model for joint encoding. This results in a re-scoring score for each candidate text block, representing its relevance. The re-scoring score is between 0 and 1, reflecting whether the candidate text block truly answers the user query, contains valid legal basis, and is logically matched. All candidate blocks are re-ranked based on the re-scoring scores, and the top K=5-10 text blocks are selected as high-value contexts and input into the large language model to generate the final answer.

[0043] In this embodiment, as one possible approach, the text embedding model optimized for the legal domain is obtained by further pre-training and fine-tuning the general text embedding model on a large-scale legal corpus. The goal of fine-tuning is to make semantically similar legal expressions, such as "excessive liquidated damages" and "request for reduction of liquidated damages," closer in the vector space, while superficially similar but semantically different legal expressions, such as "contract invalid" and "contract terminated," farther apart in the vector space. Legal corpora include court judgments, laws and regulations, judicial interpretations, and legal commentaries; the general text embedding model can be bge-m3, but is not limited to this. The text embedding model optimized for the legal domain employs a contrastive learning strategy during training, where positive sample pairs consist of legal issues and related text blocks, while negative sample pairs consist of legal issues and unrelated text blocks.

[0044] In this embodiment, as one possible approach, the fine-tuned cross encoder model is obtained by fine-tuning the general cross encoder model on legal domain labeled data; The general cross-encoder model is a publicly available, pre-trained cross-encoder model, such as bge-reranker-v2-m3. These models have basic semantic matching capabilities, but lack understanding of legal terms, reasoning logic, and legal citations. Therefore, they must be domain-adaptive fine-tuned. The data labeled in the legal field consists of the following three parts: Positive samples: Real-world pairs of "legal questions - related text blocks", such as legal questions extracted from lawyer Q&A platforms or legal consultation records, and supporting paragraphs actually cited by judges in corresponding judgment documents; Random negative samples: "Legal issue - irrelevant text block" pairs, where the irrelevant text block is a text block in the knowledge base that is unrelated to the topic of the legal issue; Difficult-to-handle samples: “Legal issues - text blocks that are superficially relevant but not substantially supported”. These are text blocks that are superficially relevant but not substantially supported. They are text blocks that are recalled through the first-stage vector retrieval but are judged to be irrelevant by humans. For example, the legal issue is “compensation for mental distress”. Although the recalled text blocks contain the word “compensation”, they discuss property loss. These types of samples are crucial for improving the model’s discriminative power. All samples have been reviewed and labeled by legal professionals to ensure label accuracy; The training process for fine-tuning a general cross-encoder model includes: Each legal question and its corresponding text block are combined and concatenated into a single input sequence, in the format "[Question] User Question Content [Separator] Text Block Content"; The general cross-encoder model outputs a scalar score, and the training objective is to maximize the score of positive samples and minimize the score of negative samples. The training objective uses a pairwise ranking loss function, meaning that for the same problem, the model needs to learn to score positive samples higher than negative samples, rather than just predicting absolute scores. Training is performed on a dedicated legal GPU cluster, using tens of thousands to hundreds of thousands of labeled data pairs, and undergoes multiple rounds of iteration until convergence is achieved on the ranking metrics, such as MRR, on the validation set.

[0045] In this embodiment, as one possible approach, the graph structure enhancement module adopts a graph structure enhancement context fusion strategy to integrate knowledge graph capabilities into the RAG process to enhance the systematic understanding of the law, including: a legal knowledge graph construction unit and a graph query and context enhancement unit.

[0046] In this embodiment, as one possible implementation method, the legal knowledge graph construction unit is used for: By using a finely tuned named entity recognition model in the legal field, legal entities such as legal norms, legal concepts, judicial subjects, and case elements are extracted from text blocks; By combining rules and models, semantic relationships between legal entities can be identified, such as "reference", "belonging", and "claim". Legal entities and semantic relationships are stored in a graph database to form a knowledge graph in the legal field, and incremental updates are supported.

[0047] In this embodiment, as one possible implementation method, the graph query and context enhancement unit is used for: After obtaining the top K text blocks after reordering, extract the legal entities in these text blocks as anchor points. Search for the shortest semantic path between anchor entities in the legal domain knowledge graph, and convert the path information into natural language descriptions as supplementary context. Based on anchor entity nodes, their multi-hop neighbors are sampled, graph neural networks are used to generate graph augmentation vectors for text blocks, and these vectors are fused with the original text vectors to perform secondary reordering of relevance in order to discover important evidence of indirect associations.

[0048] The graph structure enhancement module possesses GraphRAG capabilities, which is a technology that combines graph structure knowledge with traditional RAG. It does not simply call existing graph databases, but instead constructs a legal knowledge graph around the relationships between legal entities. By constructing a knowledge graph between legal entities, the search results are mapped to the graph structure. Graph neural networks or path reasoning are used to enhance contextual relevance, enabling the large language model to not only cite isolated legal provisions, but also understand the citation, conflict, or applicability relationships between legal provisions.

[0049] In this embodiment, as one possible implementation method, the graph structure enhancement module performs the following steps: 1. Construction of a legal knowledge graph: 11. Using a named entity recognition model fine-tuned through a legal corpus, extract the following four core legal entities from all text blocks: Legal norms, such as "Article 234 of the Criminal Law"; Legal concepts such as "intentional injury," "liquidation penalty," and "bona fide acquisition." Judicial entities, such as "Beijing Third Intermediate People's Court" and "Plaintiff Zhang San"; Case elements, such as "invalid contract" and "compensation for emotional distress"; 12. Identify semantic relationships between legal entities through a combination of rules and models, for example: Article 584 of the Civil Code was cited in the judgment No. (2023) Hu 01 Min Zhong 1234; the crime of intentional injury is punishable by imprisonment of up to three years; the plaintiff, Zhang San, claimed compensation for mental distress; the invalidity of the contract resulted in the return of property. 13. Storage and updating of legal knowledge graphs All legal entities are treated as nodes, and the semantic relationships between legal entities are treated as edges, constructing a large-scale legal knowledge graph, which is stored in a graph database such as Neo4j; The legal knowledge graph supports incremental updates. When new cases are added to the database, the newly added entities and relationships are automatically extracted and integrated into the existing graph. Legal knowledge graphs are domain-specific knowledge graphs. Although they are based on existing entity recognition and relation extraction technologies, their entity definitions, relations, and extraction rules are all customized for the Chinese legal system, demonstrating significant domain innovation.

[0050] 2. Map the search results to a graph structure. After the RAG process is rearranged, the system not only obtains several highly relevant text blocks, but also simultaneously obtains the list of legal entities contained in these text blocks; The system uses these entities as anchor points to locate the corresponding nodes in the knowledge graph. For example, if two text blocks are retrieved that mention "Article 584 of the Civil Code" and "loss of expected profits" respectively, the system will find these two nodes in the graph. 3. Enhance contextual relevance using graph neural networks or path reasoning. By systematically integrating GNNs and path reasoning into the context construction stage of legal RAGs, the problem of "isolated legal citations" is solved, giving the generated answers a stronger awareness of the legal system. This is achieved through two complementary approaches: Method 1: Path-based context expansion Searching for the shortest semantic path between two or more retrieved entities in a knowledge graph, for example, from "Article 584 of the Civil Code" to "loss of expected profits", the graph may reveal the path: Article 584 → stipulates → "scope of compensation for breach of contract" → includes → "loss of expected profits"; The intermediate nodes and relationship descriptions along the path, such as "stipulate" and "include", are transformed into natural language prompts and appended to the original search text as enhanced context input to the large language model. This enables the large language model to not only see isolated legal provisions, but also understand their legal connotations and applicable boundaries. Method 2: Correlation Recalibration Driven by Graph Neural Networks The entity nodes corresponding to the Top-K text blocks screened out by the multi-stage retrieval and rearrangement module are used as seeds to perform multi-hop neighbor sampling in the knowledge graph, such as all related legal provisions, cases, and concepts within 2 hops; Using a lightweight graph neural network, such as GraphSAGE, embeddings are aggregated on these nodes to generate graph augmentation vectors for each text block; By fusing the graph-enhanced vector with the original text vector, the relevance score to the user query is recalculated, achieving a secondary rearrangement based on the graph structure. This method is particularly useful for discovering indirect but important legal grounds, such as a case that does not directly cite a legal provision but is highly relevant through an intermediate concept.

[0051] In this embodiment, as one possible approach, the modular process execution engine is responsible for scheduling the above modules and executing the end-to-end processing flow. Its logical steps include: document ingestion → document parsing and structuring → semantic segmentation and metadata injection → vectorization and index construction → receiving user queries → task type judgment and knowledge base selection → multi-stage retrieval and reordering → graph structure context enhancement → large language model answer generation and tracing. Each step has clear responsibilities, is pluggable, monitorable, and traceable.

[0052] See Figure 2 The present invention also provides an embodiment of an enhanced retrieval and generation method for the legal field, which, using the above-described enhanced retrieval and generation system for the legal field, includes the following steps: S1: Document Processing and Indexing: Retrieve legal documents, distinguish their public or personal attributes; perform structural recognition and semantic-aware segmentation to generate text blocks with metadata; vectorize the text blocks using a legal domain-optimized text embedding model and store them in a vector database index according to knowledge base type; S2: Query Processing: Receive user queries, determine their task type, and determine the proportion of retrieval from public and personal knowledge base indexes based on preset weights; S3: Retrieval and Reordering: First, recall candidate text blocks from a specified index using vector similarity; then, use a legal domain-fine-tuned cross-encoder model to re-score and sort the candidate blocks based on relevance, selecting the K most relevant text blocks; S4: Graph Augmentation: Extract legal entities from the K text blocks, query the legal knowledge graph, and obtain legal entity association information to enhance the generation context; S5: Answer Generation: Combine the K text blocks with the legal entity association information obtained from the legal knowledge graph to form an enhanced context, input it into a large language model to generate the final answer, and attach citation traceability information for the text blocks used.

[0053] In this embodiment, as one possible approach, in S2, each type of task—legal and regulatory provisions query, similar case retrieval, legal text drafting, and compliance risk assessment—corresponds to a different public knowledge base recall quota ratio α.

[0054] In this embodiment, as one possible approach, in S4, obtaining legal entity association information includes: finding the shortest semantic path between legal entity nodes in the legal knowledge graph and converting the path information into natural language text.

[0055] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of this patent should be determined by the appended claims.

Claims

1. An enhanced search and generation system for the legal field, characterized in that, include: The module for constructing and managing multi-source heterogeneous legal knowledge bases is used to construct and differentiate between public and personal knowledge bases, and to configure a knowledge source weight mechanism that dynamically adjusts the retrieval weights of the two types of knowledge bases based on task type. The legal text refinement module is used to perform structural recognition and semantic perception segmentation on legal texts, and to annotate the segmented text blocks with structured metadata; The multi-stage retrieval and reordering module is used to first perform vector similarity screening using a text embedding model optimized for the legal domain, and then use a cross-encoder model fine-tuned for the legal domain to reorder the relevance of the screening results. The graph structure enhancement module is used to construct a legal knowledge graph based on entities and relationships extracted from legal texts, and to use this graph to expand or recalibrate the search results in terms of contextual relevance after retrieval. The modular process execution engine is used to schedule the above modules and execute the end-to-end process from document ingestion, processing, indexing to query response and answer generation.

2. The enhanced retrieval and generation system according to claim 1, characterized in that, The knowledge source weighting mechanism includes the following steps: Determine the task type for the task submitted by the user; The public knowledge base recall quota ratio α is preset according to the determined task type, and the corresponding personal knowledge base recall quota ratio is 1-α, where α∈[0,1]. Based on the recall quota ratio between the public knowledge base and the personal knowledge base, the tasks proposed by users are retrieved in the public knowledge base and / or the personal knowledge base.

3. The enhanced retrieval and generation system according to claim 2, characterized in that, Task type determination can be achieved in any of the following ways: Method 1: Automatically classify task types using a large language model; Method 2: Fine-tune BERT to create a small classification model for automatic task classification; Method 3: The user manually selects the option on the front-end interface.

4. The enhanced search and generation system for the legal field according to claim 1, characterized in that, The legal text refinement module includes: The legal text structure recognition unit has a built-in legal text structure parsing rule base, which is used to identify standard logical paragraphs in legal texts by combining keyword matching, typesetting features, contextual semantic coherence and paragraph length distribution multi-dimensional clues. Semantic-aware segmentation units are used to segment legal texts using standard logical paragraphs as basic units, and to perform secondary aggregation sub-segmentation on ultra-long paragraphs based on the semantic similarity between sentences. Legal element protection unit, used to enforce a mandatory non-segmentation strategy for units in legal texts that have independent legal significance; Multi-granularity text blocks and metadata annotation units are used to generate multi-granularity text blocks from the same legal text and to annotate the text blocks with structured metadata.

5. The enhanced search and generation system for the legal field according to claim 1, characterized in that, The multi-stage retrieval and reordering module incorporates a multi-stage retrieval and reordering mechanism, executing the process of finding and filtering the most relevant text blocks from the knowledge base, including two sequential stages: The first stage, vector retrieval screening based on vector similarity: All segmented text blocks in the knowledge base are converted into document vectors in the early stage using a text embedding model optimized for the legal domain, and stored in the vector database; the user query is converted into a query vector using a text embedding model optimized for the legal domain; in the vector database, the cosine similarity between the query vector and all document vectors is calculated, and the top N candidate text blocks with the highest similarity are recalled; The second stage, fine-grained re-ranking based on the cross-encoder: The finely tuned cross-encoder model is used to finely re-rank the N candidate text blocks recalled in the first stage, specifically including the following steps: The user query is concatenated with each candidate text block to form an input sequence, which is then fed into a fine-tuned cross-encoder model for joint encoding, resulting in a re-scoring score representing the relevance of each candidate text block. All candidate blocks are re-ranked based on the re-scoring scores, and the top K text blocks are selected as high-value contexts and fed into the large language model to generate the final answer.

6. The enhanced search and generation system for the legal field according to claim 1, characterized in that, The text embedding model optimized for the legal domain is obtained by further pre-training and fine-tuning the general text embedding model on a large-scale legal corpus; the text embedding model optimized for the legal domain adopts a contrastive learning strategy during training.

7. The enhanced search and generation system for the legal field according to claim 1, characterized in that, The fine-tuned cross-encoder model was obtained by fine-tuning the general cross-encoder model on labeled data in the legal field; The legal domain labeled data consists of the following three parts: positive samples: "legal issues - relevant text blocks" pairs; relevant text blocks are text blocks in the knowledge base that are related to legal issues; random negative samples: "legal issues - irrelevant text blocks" pairs; irrelevant text blocks are text blocks in the knowledge base that are not related to legal issues; hard negative samples: "legal issues - text blocks that are superficially relevant but not substantially supported" pairs; text blocks that are superficially relevant but not substantially supported are text blocks that were recalled through the first-stage vector retrieval but were judged to be irrelevant by manual review. All samples have been reviewed and labeled by legal professionals. The training process for fine-tuning the general cross-encoder model includes: combining and concatenating each legal question and its corresponding text block into a single input sequence; the general cross-encoder model outputs a scalar score; The training objective uses a pairwise ranking loss function, where positive samples score higher than negative samples for the same problem. Training is performed on a dedicated legal GPU cluster, using tens of thousands to hundreds of thousands of labeled data pairs, and undergoes multiple iterations until the ranking metric on the validation set converges.

8. The enhanced search and generation system for the legal field according to claim 1, characterized in that, The graph structure enhancement module includes: The legal knowledge graph construction unit is used to extract legal entities from text blocks using a finely tuned named entity recognition model in the legal domain; identify semantic relationships between legal entities by combining rules and models; and construct a legal knowledge graph with legal entities as nodes and semantic relationships as edges. The graph query and context enhancement unit is used to extract legal entities from the reordered text blocks, obtain semantic paths between legal entities by querying the legal knowledge graph to expand the context, and / or use graph neural networks to generate graph enhancement features based on entity nodes and their neighbors to perform secondary recalibration of the relevance of the text blocks.

9. An enhanced retrieval and generation method for the legal field, characterized in that, The enhanced search and generation system for the legal field as described in any one of claims 1-8 includes the following steps: S1: Document Processing and Indexing: Retrieve legal documents and distinguish their public or personal attributes; perform structural recognition and semantic-aware segmentation to generate text blocks with metadata; vectorize the text blocks using a legal domain-optimized text embedding model and store them in a vector database index according to knowledge base type; S2: Query Processing: Receive user queries, determine their task type, and determine the proportion of retrieval from public and personal knowledge base indexes based on preset weights; S3: Retrieval and Reordering: First, recall candidate text blocks from a specified index using vector similarity; then, use a legal domain-fine-tuned cross-encoder model to re-score and sort the candidate blocks based on relevance, selecting the K most relevant text blocks; S4: Graph Augmentation: Extract legal entities from the K text blocks, query the legal knowledge graph, and obtain legal entity association information to enhance the generation context; S5: Answer Generation: Combine the K text blocks with the legal entity association information obtained from the legal knowledge graph to form an enhanced context, input it into a large language model to generate the final answer, and attach citation traceability information for the text blocks used.

10. The enhanced retrieval and generation method for the legal field according to claim 9, characterized in that, In S2, the task types include at least: legal and regulatory provisions search, similar case retrieval, legal text drafting, and compliance risk assessment. Each type of task corresponds to a different public knowledge base recall quota ratio α. In S4, obtaining legal entity association information includes: finding the shortest semantic path between legal entity nodes in the legal knowledge graph and converting the path information into natural language text.