File analysis method and device, electronic equipment and storage medium
By constructing a traceability knowledge base with row-level positioning identifiers, the problem of the lack of precise source of results generated by large models in judicial case file analysis was solved, realizing the transparency and precise traceability of analysis results, and improving the verifiability and usability in judicial practice.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI IFLYTEK INTELLIGENT SYST
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, large models generate results in judicial case file analysis that lack precise mapping of the original source, making manual verification difficult and the positioning granularity coarse, which is insufficient to meet the requirements of judicial trials for the accuracy of the evidence chain.
A traceability knowledge base containing row-level location identifiers is constructed. By decomposing the analysis results and retrieving the traceability knowledge base in reverse, a precise mapping relationship between the generated content and the original case files is established, thereby realizing row-level traceability capability.
This ensures that every analytical conclusion generated is verifiable and traceable, improving the efficiency and accuracy of judicial personnel in verifying evidence and reviewing case files, and enhancing the interpretability and credibility of case file analysis results.
Smart Images

Figure CN122196124A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a file analysis method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the development of large language model technology, its application in intelligent analysis of judicial case files has become increasingly widespread, and it has gradually become a core research direction of smart justice.
[0003] Currently, intelligent analysis solutions for electronic case files mainly include end-to-end large-scale model-based direct generation solutions, which automatically generate analysis results by inputting the entire case file into the model; and rule-based or template-based generation solutions.
[0004] However, both of these approaches have significant drawbacks in practical applications. The large model generation process is a black box, resulting in insufficient transparency of the generated results. In actual judicial applications, their accuracy still needs to be verified manually, which requires searching and comparing them in a massive amount of case files. This process is not only cumbersome and inefficient, but also fails to meet the accuracy requirements of the evidence chain, resulting in the credibility of the generated results and their inability to fully meet practical needs. Summary of the Invention
[0005] This invention provides a dossier analysis method, apparatus, electronic device, and storage medium to solve the problems of difficulty in verifying large model analysis results and lack of accurate sources in the prior art.
[0006] This invention provides a method for analyzing case files, comprising: Obtain analysis instructions for the target file, and generate analysis results for the target file based on the analysis instructions; Based on the analysis results, the source tracing knowledge base is retrieved to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; Construct a mapping relationship between each source statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level source tracing capability for the target file.
[0007] According to a dossier analysis method provided by the present invention, the source knowledge base further includes a text vector corresponding to the text content of each text block; Based on the analysis results, the source tracing knowledge base is retrieved to obtain the original text block corresponding to each source tracing statement in the analysis results, including: Each source tracing statement is segmented into words to obtain a source tracing word segmentation set, and the source tracing knowledge base is matched based on the source tracing word segmentation set to obtain a first text block set; Each source tracing statement is sparsely encoded to obtain a source tracing sparse vector, and the source tracing sparse vector is sparsely matched with the text vector of each text block in the source tracing knowledge base to obtain a second set of text blocks; Each source statement is densely encoded to obtain a source dense vector. The source dense vector is then densely matched with the text vectors of each text block in the first text block set and the text vectors of each text block in the second text block set to obtain the original text block corresponding to each source statement.
[0008] According to a dossier analysis method provided by the present invention, the step of densely matching the source tracing dense vector with the text vector of each text block in the first text block set and the text vector of each text block in the second text block set to obtain the original text block corresponding to each source tracing statement includes: The first text block set and the second text block set are merged and deduplicated, and the source tracing dense vector is densely matched with the text vector of each text block in the merged and deduplicated text block set to obtain the candidate text block set corresponding to each source statement; Based on at least one of the case type corresponding to the target case file, the analysis type to which the corresponding source statement belongs, and the keyword overlap between the corresponding source statement and each candidate text block in the candidate text block set, the candidate text blocks are filtered to obtain the target text block set corresponding to each source statement. Based on the target text block set, the original text block corresponding to each source statement is determined.
[0009] According to a dossier analysis method provided by the present invention, determining the original text block corresponding to each source statement based on the target text block set includes: Based on the matching scores of each text block in the target text block set in text matching, sparse matching and dense matching, determine the text matching score set, sparse matching score set and dense matching score set corresponding to each source statement; The scores of the text matching score set, sparse matching score set, and dense matching score set are normalized respectively. Based on the case type, the score weights of the normalized text matching score set, sparse matching score set, and dense matching score set are determined. Based on the score weights, the normalized text matching score set, sparse matching score set, and dense matching score set are fused to obtain the comprehensive score set corresponding to each source statement; based on the comprehensive score set corresponding to each source statement and the target text block set, the original text block corresponding to each source statement is determined.
[0010] According to a case file analysis method provided by the present invention, the mapping relationship includes a single-level mapping relationship or a multi-level mapping relationship; the step of constructing the mapping relationship between each source statement and its corresponding original text block includes: The generation source of each tracing statement is determined, including direct generation from the target dossier and indirect generation from intermediate conclusions; If any source statement is generated directly from the target file, then the first-level mapping relationship is constructed; the first-level mapping relationship is a key-value storage structure with the line and sentence identifier of the analysis result corresponding to any source statement as the key, and the text content and line-level positioning identifier of the corresponding original text block as the value. If the source of any tracing statement is indirectly generated from an intermediate conclusion, then the multi-level mapping relationship is constructed; the multi-level mapping relationship is a tracing link that starts with the line marker of the analysis result corresponding to any tracing statement and ends with the line marker of the intermediate conclusion that generated any tracing statement.
[0011] According to a file analysis method provided by the present invention, the step of generating a file analysis result with line-level tracing capability for the target file based on the mapping relationship and the corresponding line-level location identifier includes: Responding to a source tracing operation for any source tracing statement in the analysis results; If the mapping relationship of any of the source tracing statements is the multi-level mapping relationship, then the source tracing link is searched level by level until the target source tracing statement with the first-level mapping relationship is traced. Based on the line-level positioning identifier of the original text block corresponding to the target source tracing statement, a file analysis result with line-level tracing capability is generated for the target file. If the mapping relationship is the first-level mapping relationship, then based on the line-level positioning identifier of the original text block corresponding to any of the tracing statements, a file analysis result with line-level tracing capability is generated for the target file.
[0012] According to a case file analysis method provided by the present invention, the step of generating analysis results of the target case file based on the analysis instructions includes: Based on the analysis instructions, a structured query task is generated, and the structured query task is decomposed to obtain multiple sub-tasks with logical dependencies. Based on the logical dependencies between the multiple subtasks, the multiple subtasks are assembled and ordered for execution to obtain a task execution sequence; Based on the task execution sequence and the task complexity of each subtask, the tasks are executed sequentially to obtain the analysis results of the target dossier.
[0013] The present invention also provides a file analysis device, comprising: An analysis unit is used to obtain analysis instructions for a target file and generate analysis results for the target file based on the analysis instructions. The retrieval unit is used to retrieve the source tracing knowledge base based on the analysis results to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; The tracing unit is used to construct a mapping relationship between each tracing statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level tracing capability for the target file.
[0014] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the file analysis method as described above.
[0015] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the dossier analysis method as described above.
[0016] The case file analysis method, apparatus, electronic device, and storage medium provided by this invention construct a source knowledge base containing line-level positioning identifiers and then decompose the analysis results into source sentences for reverse retrieval of the knowledge base. This establishes a mapping relationship between the generated content and the original case file that is accurate to the line and sentence level, and even at the coordinate level. This effectively solves the black box problem that exists in the current application of large models in the judicial field, making every generated analysis conclusion verifiable and traceable. The line-level positioning identifiers allow users to directly access the original source by simply clicking on the analysis result without having to manually search through massive amounts of case files. This greatly improves the efficiency and accuracy of judicial personnel in verifying evidence and reviewing case files, and enhances the interpretability and credibility of the case file analysis results. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0018] Figure 1 This is a flowchart illustrating the file analysis method provided by the present invention; Figure 2 This is a flowchart illustrating the knowledge base retrieval process provided by the present invention; Figure 3 This is a flowchart illustrating the task planning process provided by the present invention; Figure 4 This is a schematic diagram of the structure of the file analysis device provided by the present invention; Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0020] With the rapid development of Large Language Model (LLM) technology, its application in intelligent analysis of judicial case files is becoming increasingly widespread, and it has become a core research direction in the field of smart justice.
[0021] Current intelligent analysis solutions for electronic case files can be mainly divided into two categories: one is a direct generation solution based on an end-to-end large model, which directly inputs the entire case file into the model, and the model automatically generates analysis results such as summaries, points of contention, and fact-findings based on semantic understanding, but this process lacks explicit recording of intermediate steps; the other is a generation solution based on rules or templates, which integrates traditional natural language processing technology and extracts fragments from the case file as the basis for generation through keyword matching, named entity recognition, etc.
[0022] However, both of these approaches have significant drawbacks in practical applications, hindering their in-depth application in judicial practice. Specifically, on the one hand, the generation process of LLM is typically a black box, and its output analysis conclusions often lack a direct mapping relationship with the original case files, resulting in untraceable results and insufficient transparency. On the other hand, even if some existing approaches can provide source identification, their granularity is mostly limited to the document or chapter level, which is very coarse and cannot accurately locate specific sentences or lines of text. This means that after obtaining the analysis results, judicial staff still need to manually search and compare them in massive amounts of electronic case files to verify their accuracy. This process is not only cumbersome and inefficient, but also fails to meet the stringent requirements of judicial trials for the accuracy of the evidence chain, making it difficult for the credibility of the analysis results to fully meet practical needs.
[0023] In response, this invention provides a case file analysis method that aims to solve the problem that the analysis results generated by current large models lack accurate original source mapping, leading to difficulties in manual verification and coarse granularity of location. This method achieves transparency and accuracy of the analysis results, as well as efficient row-level reverse location tracing, thereby improving the verifiability of the analysis results and their usability in judicial practice.
[0024] Figure 1 This is a flowchart illustrating the file analysis method provided by the present invention. Figure 1 As shown, this method is applied to case file analysis systems (hereinafter referred to as "systems"), such as smart court auxiliary case handling systems, intelligent review systems, and law firm case file management platforms. The method includes: Step 110: Obtain the analysis instructions for the target file and generate the analysis results of the target file based on the analysis instructions; Step 120: Based on the analysis results, the source tracing knowledge base is retrieved to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block. Step 130: Construct a mapping relationship between each source statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level source tracing capability for the target file.
[0025] Specifically, during judicial trials or case reviews, the system first needs to identify the case file to be analyzed, i.e., the target case file. The target case file refers to the collection of all electronic documents related to a specific case, which includes various types of legal documents such as indictments, answers, lists of evidence, court transcripts, and scanned copies of contracts. The file formats may include JPG (Joint Photographic Experts Group) images, PDF (Portable Document Format), Word documents, etc.
[0026] After identifying the target case file, the system can also obtain analysis instructions for the target case file. These analysis instructions can be natural language instructions input by the user, which carry the user's specific business needs. For example, the user may input natural language instructions such as "Please summarize the plaintiff's claims and basis in this case" or "Extract the focus of the dispute regarding jurisdiction in this case." Alternatively, they can be standardized instructions automatically preset by the system based on the case type corresponding to the target case file. This embodiment of the invention does not specifically limit the specifics of these instructions.
[0027] After receiving the analysis instructions for the target case file, the system enters the generation phase. Utilizing pre-built artificial intelligence models, such as LLM (Limited Language Management) with long-text understanding capabilities, it performs deep semantic analysis on the content of the target case file. Specifically, based on the intent of the analysis instructions, the model extracts key information from the complex materials within the target case file and generates one or more summaries or abstracts. These texts constitute the analysis results of the target case file. For example, in a contract dispute case, the generated analysis result might be a statement: "The defendant withdrew from the site before the project was completed, constituting a de facto breach of contract."
[0028] It is important to note that although the analysis results generated at this point respond to the business / user requirements corresponding to the analysis instructions in terms of content, they are essentially still text summaries and have not yet established a precise mapping relationship with the specific page numbers, line numbers, etc. in the target file. Users cannot directly verify their accuracy.
[0029] Therefore, to make the above analysis results verifiable, this embodiment of the invention also requires finding supporting evidence in a pre-built source tracing knowledge base. This source tracing knowledge base is a structured index specifically built by the system for the target file. Specifically, the system first standardizes the target file, assigns it a number, and assigns a unique identifier to each file, thus obtaining the file number and file identifier (File_ID) of the target file. Next, it uses OCR (Optical Character Recognition) and layout analysis techniques to extract the full text content and finely divides it into multiple text blocks. These text blocks constitute the basic unit of the source tracing knowledge base, and each text block stores extremely rich metadata, including not only the text content of the text block but, more importantly, the line-level positioning identifier of the text block.
[0030] Here, the line-level positioning identifier records the precise location of the corresponding text block in the target volume, such as the file identifier (File_ID) of the file to which it belongs, the page number identifier (Page_No) of the page number in the file to which it belongs, the line number identifier (Line_No) of the line number in the page number, the statement number of the contained statement (Sentence_No), and even the coordinates of each character in the text block in the page image of the page corresponding to the page number, such as the horizontal and vertical coordinates (X, Y) of the upper left corner and the width and height of the character (W, H).
[0031] Specifically, the system first breaks down the analysis results into sentence-level granularities, decomposing them into several independent source statements. For example, a complex case summary, such as "Dispute: The plaintiff claims the defendant should pay liquidated damages for overdue payment, while the defendant claims the contract is invalid and no payment is required," can be broken down into source statements based on semantics or punctuation. Here, source statements refer to the smallest granular inductive statements in the analysis results. By breaking down the analysis results into the smallest granular inductive statements before tracing their origins, the ambiguity in overall source tracing can be effectively avoided.
[0032] Next, the system uses the derived source statements as query conditions to perform high-precision retrieval and matching in the source knowledge base. By calculating the semantic similarity, keyword overlap, and vector similarity between the source statements and various text blocks in the source knowledge base, the system can accurately find the original text block that best matches each analysis conclusion (each source statement) and serves as the most factual basis from the massive amount of segmented text blocks. For example, if the analysis result mentions "the plaintiff has paid the down payment," the system can retrieve the corresponding text block of a bank transfer receipt in the source knowledge base, which records the specific transfer amount and time.
[0033] After identifying the source of each analytical conclusion, the system needs to solidify this correspondence between each conclusion and its source to form a mapping relationship. This mapping relationship acts like invisible bonds, binding the source statements generated by the model to the retrieved original text blocks one by one. This binding can be one-to-one, meaning one analytical conclusion corresponds to one original text block, or it can be one-to-many, meaning one analytical conclusion corresponds to multiple original text blocks with multiple sources; this embodiment of the invention does not specifically limit this.
[0034] Finally, the system injects these mapping relationships, along with the line-level location identifiers of the original text blocks extracted from the source knowledge base, into the analysis results. This data augmentation and encapsulation process generates the final dossier analysis result with line-level source tracing capabilities. This result is no longer just dry text, but a data object containing rich interactive information.
[0035] When the results are presented on the front-end interface, each analysis conclusion implicitly links to the specific location of the corresponding original text block in the target file. Users simply click on an analysis conclusion, and the system automatically retrieves and opens the corresponding file in the target file based on the associated line-level location identifiers, such as File_ID, Page_No, Line_No, Sentence_No, and horizontal and vertical coordinates. It then jumps to the specified page number and precisely marks the corresponding original text block (accurate to the line and sentence level) on that page using highlighting, selection, or other methods, thus achieving WYSIWYG line-level precise source tracing.
[0036] The case file analysis method provided by this invention constructs a source knowledge base containing line-level positioning identifiers and then decomposes the analysis results into source sentences and performs a reverse retrieval of the knowledge base. This establishes a mapping relationship between the generated content and the original case file that is accurate to the line and sentence level, and even at the coordinate level. This effectively solves the black box problem that exists in the current application of large models in the judicial field, making every generated analysis conclusion verifiable and traceable. The line-level positioning identifiers allow users to directly access the original source by simply clicking on the analysis result without having to manually search through massive amounts of case files. This greatly improves the efficiency and accuracy of judicial personnel in verifying evidence and reviewing case files, and enhances the interpretability and credibility of the case file analysis results.
[0037] Based on the above embodiments, the source tracing knowledge base also includes text vectors corresponding to the text content of each text block; step 120 includes: Each source statement is segmented into words to obtain a source segmentation set, and the source knowledge base is matched based on the source segmentation set to obtain the first text block set; Each source tracing statement is sparsely encoded to obtain a source tracing sparse vector, and the source tracing sparse vector is sparsely matched with the text vector of each text block in the source tracing knowledge base to obtain a second set of text blocks. Each source statement is densely encoded to obtain a source dense vector. The source dense vector is then densely matched with the text vectors of each text block in the first text block set and the text vectors of each text block in the second text block set to obtain the original text block corresponding to each source statement.
[0038] Specifically, in order to further improve the recall and precision of source tracing, the data structure of the source tracing knowledge base is enhanced in this embodiment of the invention. That is, it not only stores the text content and line-level positioning identifier of each text block, but also stores the text vector corresponding to each text block obtained in advance through vectorization. The text vector may include dense vectors representing semantics and sparse vectors representing keyword weights.
[0039] Figure 2 This is a flowchart illustrating the knowledge base retrieval process provided by the present invention, such as... Figure 2 As shown, the process of retrieving the source knowledge base based on the analysis results may specifically include: First, the system performs word segmentation on each source statement obtained from the splitting process, dividing it into semantically independent word units. These word units constitute the source statement's word segmentation set. For example, the source statement "the defendant failed to pay the payment on time" is segmented into "defendant", "failed to pay on time", "payment", and "payment".
[0040] Subsequently, based on the source word segmentation set corresponding to each source statement, the system performs text matching in the source knowledge base based on inverted indexes or keyword frequency, such as using the BM25 (Best Matching 25) algorithm to quickly filter out text blocks that highly overlap with the corresponding source statement in terms of wording, such as the top-K1 text blocks (e.g., K1=20). These text blocks constitute the first text block set S of the corresponding source statement. bm25 The text blocks in this set typically contain the core keywords from the corresponding source statements, ensuring accurate matching capabilities for the search.
[0041] Meanwhile, to address the issue that keyword matching cannot handle synonyms or variants of legal terms, the system performs sparse coding on each traceability statement. Different from word segmentation, sparse coding, for example, using the SPLADE (Sparse Lexical and Expansion Model for First Stage Ranking) model for sparse coding, not only focuses on the words that appear in the statement but also predicts relevant potential words based on the context of the words and assigns different weights to each word. For example, a higher weight is assigned to the legal term "breach of contract" than to the common word "of", thereby obtaining a high-dimensional sparse vector corresponding to each traceability statement, that is, the traceability sparse vector.
[0042] Next, the system can perform sparse matching on the traceability knowledge base based on the traceability sparse vector of each traceability statement. That is, it matches the traceability sparse vector of each traceability statement with the sparse vectors in the text vectors of all text blocks in the traceability knowledge base to recall those text blocks that, although not exactly the same literally as the corresponding traceability statement, have a highly similar distribution of keyword weights. For example, the Top-K2 text blocks (such as K2 = 20), and these text blocks constitute the second text block set S corresponding to the traceability statement. sparse .
[0043] Here, it is worth noting that the method for determining the traceability sparse vector is the same as that for the sparse vector in the text vector of the text block, that is, the same sparse coding method is used to process the traceability statement and the text block. For example, the SPLADE model can be used for sparse coding to obtain the traceability sparse vector of each traceability statement and the sparse vector of each text block.
[0044] Moreover, the model used for sparse coding here, such as the SPLADE model, can also be fine-tuned judicially through legal corpora, and the judicially fine-tuned model can assign higher weights to legal terms than to common words, thus better addressing the issue of legal term variants.
[0045] Finally, after obtaining the candidate first text block set and second text block set through the above two-way recall, in order to select the original text block with the closest meaning to the true value from these candidates, the system performs dense coding on each traceability statement. For example, using pre-trained models such as BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimized BERT Pretraining Approach) to compress each traceability statement into a low-dimensional real-valued vector, that is, the traceability dense vector of each traceability statement, which can capture the deep semantics and context logic of the corresponding traceability statement.
[0046] Subsequently, instead of searching the entire database, the system utilizes the source dense vector of each source statement to perform targeted dense matching on text blocks in the first and second text block sets. Specifically, it matches the source dense vector of each source statement with the dense vectors of all text blocks in the first and second text block sets, such as calculating cosine similarity, to determine the original text block corresponding to each source statement. This is essentially a reordering process, using deep semantic understanding to eliminate text blocks (noisy data) in the first and second text block sets that are merely literal or keyword matches but have irrelevant meanings, ultimately locking in one or more text blocks with the highest semantic similarity as the original text block corresponding to each source statement.
[0047] In this embodiment of the invention, coarse screening and recall are first performed through text matching and sparse matching to ensure that the retrieval scope covers all possibilities of literal matching and term expansion matching. Then, semantic fine ranking is performed through dense matching to ensure the uniqueness and accuracy of the final tracing results in judicial semantics. This layer-by-layer progressive retrieval architecture greatly improves the robustness of hierarchical tracing in complex judicial case file scenarios and avoids the problem of missed detection or false detection caused by a single retrieval method.
[0048] Based on the above embodiments, dense matching is performed between the source tracing dense vector and the text vector of each text block in the first text block set and the text vector of each text block in the second text block set to obtain the original text block corresponding to each source tracing statement, including: The first and second text block sets are merged and deduplicated, and the source dense vector is densely matched with the text vector of each text block in the merged and deduplicated text block set to obtain the candidate text block set corresponding to each source statement; Based on at least one of the following factors: the case type corresponding to the target file, the analysis type to which the corresponding source statement belongs, and the keyword overlap between the corresponding source statement and each candidate text block in the candidate text block set, each candidate text block is filtered to obtain the target text block set corresponding to each source statement. Based on the target text block set, determine the original text block corresponding to each source statement.
[0049] Specifically, the process of performing dense matching on the first text block set and the second text block set based on the source dense vector can include: First, the system can process the first set of text blocks S. bm25 Second text block set S sparseThe two sets are merged to obtain their union. Next, this union can be deduplicated. This is because the same high-quality text block is likely to be hit by both keyword retrieval and sparse vector retrieval. Deduplication, such as uniqueness checks based on File_ID and Line_No, avoids wasting computational resources by repeatedly calculating the same text block. After deduplication, the set S of candidate text blocks corresponding to each source statement is obtained. candidate .
[0050] Subsequently, the system can utilize the source-dense vector of each source statement to determine the selected text block set S. candidate Dense matching is performed on all text blocks in the set, that is, the semantic similarity, such as cosine similarity, between the dense vector of each source statement and the dense vector of each text block in the set is calculated, and the text blocks are sorted in descending order of semantic similarity. The top-ranked text blocks, such as the Top-K3 (e.g., K3=10), are selected to form the candidate text block set S. ranked .
[0051] However, relying solely on semantic similarity between vectors is sometimes insufficient in judicial scenarios, as legal documents often contain numerous paragraphs with similar expressions but entirely different legal effects. Therefore, the system further refines the candidate text block set S based on at least one of the following: the case type corresponding to the target file, the analysis type of the corresponding source statement, and keyword overlap. ranked Filter the data to obtain a more precise set of target text blocks S. filtered .
[0052] The case-type-based screening process is as follows: different types of cases focus on different aspects of evidence. For example, if the target case file is a contract dispute, the system will prioritize retaining text blocks involving contract terms, performance records, transfer vouchers, etc., while reducing the weight of text blocks about tort liability descriptions, tort law provisions, etc., or removing them directly.
[0053] The filtering process based on analysis type is as follows: The system will identify the analysis type of the corresponding source statement. For example, if the source statement is an analysis about "the focus of the dispute - the authenticity of the evidence", the system will only retain the candidate text block set S. ranked The text blocks that belong to the cross-examination opinions, court transcripts, etc. are excluded, while the text blocks that belong to the basic information of the parties, litigation claims, etc. are excluded, because they usually do not contain the logic of evidence cross-examination.
[0054] The keyword overlap-based filtering process is as follows: the system calculates the corresponding source statement and the candidate text block set S. rankedThe system calculates keyword overlap among candidate text blocks, particularly the overlap of legal terms, and removes text blocks with an overlap threshold below a certain threshold. Specifically, while dense vectors can capture semantics, if a text block considered semantically similar does not overlap with the corresponding source statement on key legal terms such as breach of contract and jurisdiction, it is likely a false recall due to semantic drift. In this case, the system sets an overlap threshold, such as 0.1, and removes text blocks below this threshold to ensure that the target case file does indeed contain substantive vocabulary sufficient to support the analysis conclusions.
[0055] Finally, after the rigorous multi-dimensional screening process described above, the system obtains a high-purity set of text blocks, namely the target text block set S. filtered The system will determine the original text block corresponding to each tracing statement from the set by setting a score threshold or selecting the first position in the sorting. These original text blocks are the most credible tracing evidence recognized by the system.
[0056] In this embodiment of the invention, a judicial business logic filtering mechanism is incorporated into the vector retrieval. By filtering based on dimensions such as case type, analysis type, and keyword overlap, the problem of semantic relevance but logical errors that may arise from pure vector retrieval, such as misjudging the plaintiff's claim as the court's finding, is effectively solved. This filtering strategy, which combines judicial knowledge, can significantly improve the legal professionalism and logical accuracy of the tracing results, ensuring that the evidence presented to the user is truly relevant and effective.
[0057] Based on the above embodiments, the original text block corresponding to each tracing statement is determined based on the target text block set, including: Based on the matching scores of each text block in the target text block set during text matching, sparse matching, and dense matching, determine the text matching score set, sparse matching score set, and dense matching score set corresponding to each source statement; The scores of the text matching score set, sparse matching score set and dense matching score set are normalized respectively. Based on the case type, the score weights of the normalized text matching score set, sparse matching score set and dense matching score set are determined. Based on the score weights, the normalized text matching score set, sparse matching score set, and dense matching score set are fused to obtain the comprehensive score set corresponding to each source statement; based on the comprehensive score set corresponding to each source statement and the target text block set, the original text block corresponding to each source statement is determined.
[0058] Specifically, the process of determining the original text block corresponding to each source statement based on the target text block set may include: First, for the target text block set S filtered For each text block in the target text block set S, the system reviews its performance in each previous retrieval stage and extracts three key scores: the score for text matching (e.g., BM25 score), the score for sparse matching (e.g., SPLADE inner product), and the score for dense matching (e.g., cosine similarity). Then, the system evaluates each score dimension for the target text block set S. filtered The scores of all text blocks are aggregated to obtain the score set for each dimension, namely the text matching score set, the sparse matching score set, and the dense matching score set.
[0059] Subsequently, because the scores obtained by different retrieval / matching methods differ greatly in scale and value range—for example, BM25 scores may range from 0 to infinity, while cosine similarity is usually between 0 and 1—directly adding them would lead to the larger dimension dominating the result. Therefore, to avoid this problem, the system first performs score normalization processing on the three score sets respectively, such as using the Min-Max normalization formula to map all scores under the same score set to the standard interval [0, 1], thus eliminating the difference in scale.
[0060] Next, the system introduces a dynamic weighting mechanism to adapt to judicial scenarios. That is, the system determines the scoring weights for the three dimensions based on the case type corresponding to the target case file. For example, in contract disputes, the focus of the trial is often on details such as specific amounts, dates, and breach of contract clauses. This information requires extremely high literal accuracy. Therefore, the system automatically increases the scoring weight for the text matching dimension, such as assigning a weight of 0.5 to the text matching score set. Conversely, in tort disputes or cases involving factual statements, the focus is on the semantic description of the facts, and the literal wording may vary. In this case, the system automatically increases the scoring weight for the dense matching dimension, such as assigning a weight of 0.6 to the dense matching score set to capture semantic equivalence. In this way, the system determines the proportion that each of the three normalized score sets should occupy under the current case type, i.e., the scoring weight.
[0061] Finally, the system will perform a weighted summation of the three normalized score sets based on their respective score weights to determine the target text block set S. filtered The overall score is calculated for each text block, and the overall scores of all text blocks constitute the overall score set. Specifically, for the same text block, its scores in the three normalized score sets are weighted and summed to obtain its overall score. This overall score is the final evaluation value of each text block after incorporating literal accuracy, extended relevance, and deep semantic similarity, and has been optimized for judicial scenarios.
[0062] Therefore, the system can sort each text block according to the comprehensive score of each text block in the comprehensive score set, and select one or more text blocks with the highest comprehensive score (Top-1) to identify them as the original text blocks corresponding to the source statement.
[0063] In this embodiment of the invention, normalization processing solves the problem of inconsistent scoring dimensions among different retrieval algorithms, making hybrid calculation possible. More importantly, when integrating multiple dimensions, the rigid mode of fixed weights is abandoned, and instead, the scoring weights of each dimension are dynamically adjusted according to the case type. This allows the system to not only focus on the words but also understand the semantics, thereby significantly improving the adaptability and accuracy of the source tracing results in different judicial business scenarios and ensuring that the original text block found is the evidence that best fits the current case trial logic.
[0064] Based on the above embodiments, the mapping relationship includes a single-level mapping relationship or a multi-level mapping relationship; In step 130, the mapping relationship between each source statement and its corresponding original text block is constructed, including: Determine the source of each source statement, including direct generation from the target dossier and indirect generation from intermediate conclusions; If any source statement is generated directly from the target dossier, a first-level mapping relationship is constructed. The first-level mapping relationship is a key-value storage structure with the line and sentence identifier of the analysis result corresponding to the source statement as the key and the text content and line-level positioning identifier of the corresponding original text block as the value. If any source statement is generated indirectly from an intermediate conclusion, a multi-level mapping relationship is constructed. The multi-level mapping relationship is a source tracing link that starts with the line marker of the analysis result corresponding to the source statement and ends with the line marker of the intermediate conclusion that generated the source statement.
[0065] Specifically, in actual judicial case file analysis, the analysis results of the model are often hierarchical. Some conclusions are directly derived from the original text of the target case file, while others are further inductively derived from previous inferences. To fully record this logical chain, this embodiment of the invention proposes a hierarchical mapping mechanism, as follows: First, the system needs to determine the source of each source statement. This process requires combining metadata from the generation phase or the thought chain information of the LLM. Specifically, the system divides the generation sources into two categories: direct generation from the target file and indirect generation from intermediate conclusions. Direct generation from the target file means that the corresponding source statement is a direct extraction, summary, or translation of a specific text within the target file. For example, extracting the signing date as January 1, 2023 from a contract. Indirect generation from intermediate conclusions means that the corresponding source statement does not directly correspond to the original text of the target file, but is derived from further reasoning based on one or more intermediate conclusions. For example, first, the intermediate conclusion of "overdue behavior" is derived, and then the analytical conclusion that "the defendant should pay liquidated damages" is derived based on this intermediate conclusion.
[0066] Furthermore, if any source statement is directly generated from the target case file, it indicates that its source of evidence is singular and direct. In this case, the system can construct a first-level mapping relationship between the source statement and its corresponding original text block. This mapping relationship is a flat, direct index structure.
[0067] Specifically, the system employs a highly efficient key-value storage structure. Among these: Key: The line and sentence identifier of the source statement in the analysis results; Value: The text content of the original text block corresponding to the source statement, the line-level positioning identifier, such as File_ID, Page_No, Line_No, Sentence_No, etc., and the overall score.
[0068] With this key-value storage structure, when a user clicks on a sentence to generate an analysis conclusion, the system does not need to perform graph traversal. Instead, it can easily read the value directly through the key, thereby instantly locating the specific file, page number, line number, and sentence in the target volume.
[0069] Conversely, if any source statement is indirectly generated from an intermediate conclusion, it indicates the existence of a reasoning chain behind it. A simple direct mapping cannot explain its logical origin. Therefore, the system will construct a multi-level mapping relationship between the source statement and its corresponding original text block. This relationship is essentially a directed graph or linked list structure.
[0070] Specifically, the system records the source tracing link, including the start and end points. That is, for analytical conclusions involving multi-level reasoning, such as second-level reasoning, the second-level mapping relationship (source tracing link) includes: Starting point: The line and sentence identifier of the source statement in the analysis results; Endpoint: The line and sentence identifier of the intermediate conclusion that generated the source statement in the analysis results.
[0071] For example, it could be "the line identifier of the source statement - File_ID (the file identifier of the file to which the original text block corresponding to the intermediate conclusion belongs) + the line identifier of the intermediate conclusion".
[0072] Furthermore, its first-level mapping relationship can be a key-value storage structure. That is: Key: The line and sentence identifier of the intermediate conclusion in the analysis results; Values: The text content of the original text block corresponding to the intermediate conclusion, line-level positioning identifiers such as File_ID, Page_No, Line_No, Sentence_No, etc., and the overall score.
[0073] For example, when dealing with the tracing statement "the defendant should pay liquidated damages", the system does not directly store the corresponding original case file text. Instead, it first stores the tracing chain link that points to the intermediate conclusion "there is overdue behavior". The intermediate conclusion "there is overdue behavior" then points to the original evidence such as the specific delivery note date and payment note date through a first-level mapping relationship.
[0074] In this way, the system completely preserves the logical path from the final conclusion to the intermediate conclusion and then to the original evidence (original text block).
[0075] In this embodiment of the invention, the problem of tracing the source in complex judicial reasoning tasks is solved by distinguishing between first-level mapping and multi-level mapping. For simple extraction tasks, first-level mapping ensures response speed and efficient storage; while for complex inductive reasoning tasks, multi-level mapping fully records the thinking path of the model. This hierarchical design not only allows users to see where the evidence is, but also to understand how the conclusion is derived, thereby greatly enhancing the logical interpretability of the analysis results and filling the gap in the existing technology for tracing the source of multi-step reasoning.
[0076] Based on the above embodiments, in step 130, based on the mapping relationship and the corresponding row-level location identifier, a file analysis result with row-level tracing capability is generated for the target file, including: Responds to the tracing operation for any tracing statement in the analysis results; If the mapping relationship of the source statement is a multi-level mapping relationship, then search along the source link level by level until the target source statement with a first-level mapping relationship is found. Based on the line-level positioning identifier of the original text block corresponding to the target source statement, a file analysis result with line-level tracing capability is generated for the target file. If the mapping relationship is a first-level mapping relationship, then based on the line-level positioning identifier of the original text block corresponding to the tracing statement, a file analysis result with line-level tracing capability for the target file is generated.
[0077] Specifically, the process of generating a file analysis result with row-level tracing capability for the target file based on the mapping relationship and the corresponding row-level location identifier may include: First, when the system displays the analysis results on the front-end interface, each analysis conclusion is rendered as an interactive element, such as underlined hyperlink text or text segments with a source tracing icon on the side. The system is in real-time monitoring mode to respond to source tracing operations on the analysis conclusions, such as clicks, checkboxes / swipes, etc. For example, when a judge is reading the case summary, if they click on the analysis conclusion "The defendant unilaterally stopped work in May 2023," the system will immediately capture the click event and extract the unique identifier of that analysis conclusion, namely the sentence identifier.
[0078] The system will then first query the mapping relationship behind the analysis conclusion. If the mapping relationship of the analysis conclusion (source statement) is a multi-level mapping relationship, it means that this is a conclusion obtained through multi-level reasoning. The direct evidence may not be directly attached to the original text of the case file, but rather to the intermediate conclusion upstream.
[0079] At this point, the system initiates a chain-based tracing mechanism, which reads the source chain of the stored analysis conclusion and searches step by step along the source chain from the analysis conclusion. First, it finds the intermediate conclusion that the analysis conclusion depends on, such as "the defendant has breached the contract". Then, it checks whether the intermediate conclusion has a direct original text mapping, i.e., a first-level mapping relationship. If not, it continues to trace upwards until it reaches the intermediate conclusion with a first-level mapping relationship, i.e., the target source statement. This means that the source of the logical chain has been found, i.e., the node that directly references the original text of the case file.
[0080] Once the source node, i.e., the target tracing statement, is found, the system extracts the line-level location identifier of its corresponding original text block and generates the final tracing result based on this, which is a complete link containing the entire tracing path (each node searched level by level). Further, the system injects this tracing result into the analysis result, performs data augmentation and encapsulation, and generates the final dossier analysis result with line-level tracing capabilities. In terms of interface presentation, this might manifest as follows: when the user clicks on analysis conclusion A, the system automatically displays intermediate conclusion B supporting analysis conclusion A, and highlights the original text block directly upon which intermediate conclusion B is based in the target dossier, such as highlighting its text content according to its line-level location identifier.
[0081] Conversely, if the mapping relationship of the analysis conclusion (source statement) is a first-level mapping relationship, the processing logic is more direct and efficient. In this case, the system does not need to perform multi-level jumps, but directly reads the key-value storage structure of the analysis conclusion to obtain its Value, that is, the line-level positioning identifier of the original text block corresponding to the analysis conclusion. The system then uses this line-level positioning identifier, such as File_ID, Page_No, Line_No, Sentence_No, etc., to generate source tracing results, such as visual control commands, and inserts them into the analysis results. This generates the final file analysis results with line-level source tracing capabilities. Thus, when the user clicks on the analysis conclusion, the corresponding file can be directly loaded based on the control command and the user can jump to the specified page number. At the corresponding position, such as the line number or the contained statement, a highlight box is drawn to highlight the text content.
[0082] In this embodiment of the invention, the complex data structure in the background is transformed into a smooth user experience in the front end through differentiated source tracing response logic. For first-level mapping, millisecond-level pinpointing is achieved, greatly satisfying the high-frequency verification needs. For multi-level mapping, an automated tracing mechanism allows users to cross logical levels and reach the lowest level of evidence source. This not only ensures the accuracy of source tracing but also endows the system with the dynamic interactive capability to handle complex judicial logic, enabling seamless integration of logical reasoning and evidence display on the user interface.
[0083] Based on the above embodiments, step 110, generating the analysis results of the target file based on the analysis instructions, includes: Based on the analysis instructions, a structured query task is generated, and the structured query task is decomposed into multiple sub-tasks with logical dependencies. Based on the logical dependencies between multiple subtasks, the multiple subtasks are assembled and their execution is ordered to obtain a task execution sequence; Based on the task execution sequence and the task complexity of each subtask, the tasks are executed sequentially to obtain the analysis results of the target dossier.
[0084] Specifically, for complex legal scenarios, such as cross-document analysis and multi-step reasoning, simple single-question-and-answer sessions are often ineffective. Therefore, this embodiment of the invention introduces a dynamic task planning mechanism. Figure 3 This is a flowchart illustrating the task planning process provided by the present invention, such as... Figure 3 As shown, the process of analyzing a target file based on analysis instructions to obtain analysis results can be specifically planned as the following tasks: First, after receiving an analysis instruction for the target case file, such as "Please extract the plea analysis of this case," the system will first try to understand it. This may involve semantic parsing using LLM, rule engines, or similar methods, and then retrieving a pre-built judicial business knowledge base using RAG technology to transform it into a structured query task that the computer can understand. For example, "Please extract the plea analysis of this case" is transformed into a structured query task for "Plea Analysis."
[0085] Subsequently, the system will break down the structured query task according to judicial business logic (such as the trial logic chain), because a grand judicial issue is often composed of multiple basic factual issues, so it will break it down into multiple sub-tasks with logical dependencies.
[0086] Specifically, the system can first search the judicial business knowledge base to obtain business processing templates or sub-task rules that match the structured query task. These often contain business logic for the task. Therefore, the task can be decomposed based on this to obtain multiple sub-tasks with logical dependencies.
[0087] For example, for the structured query task of "analyzing liability for breach of contract" (the analysis instruction is "Please analyze the liability for breach of contract regarding the delivery of the house in this case"), the system will break it down into the following sub-tasks: Subtask A: Extract the clauses in the contract regarding the delivery time of the house.
[0088] Subtask B: Extract evidence records of the actual delivery date.
[0089] Subtask C: Compare the agreed time with the actual time to determine if it is overdue.
[0090] Subtask D: Based on the fact of overdue payment, find the calculation standard for liquidated damages in the contract.
[0091] There are clear dependencies between these subtasks; for example, A and B must be done before C can be done, and D can only be done after C is completed.
[0092] Subsequently, since the resulting subtasks are often scattered, the system needs to assemble and sort them based on the logical dependencies between these subtasks to obtain a task execution sequence. Specifically, for each subtask, the system first searches the source knowledge base to obtain text blocks strongly associated with the subtask, and then assembles these text blocks with business processing templates or subtask rules obtained from the judicial business knowledge base to obtain complete task rules for each subtask. Simultaneously, the system constructs a directed acyclic graph or task list based on logical dependencies and judicial business logic to clarify the execution order of each subtask, such as arranging the above subtasks into an ABCD task execution sequence. This ensures that the model's thought process conforms to the logical rigor of legal deduction and avoids the logical fallacy of drawing conclusions without ascertaining the facts.
[0093] Once the task execution sequence is determined, the system begins executing the tasks sequentially. During this process, the system incorporates an intelligent resource scheduling strategy, considering the complexity of each subtask. For simple subtasks, such as "extracting the name of the person in question," the system may call a model with fewer parameters or directly use a rule-matching algorithm to complete the task quickly, thus saving computational resources.
[0094] For complex subtasks, such as "determining whether there is a force majeure event", the system will call models with larger parameters and stronger reasoning capabilities, such as LLM, or even call external legal and regulatory databases for assistance to ensure the depth of analysis.
[0095] Specifically, during task execution, for each subtask, task instructions are first generated based on its corresponding complete task rules to obtain standardized task instructions that the model can understand. Then, these instructions are fed to the model so that it can perform analysis and output the corresponding analysis results. It is worth noting here that when executing intermediate tasks, the analysis results of subtasks with logical dependencies can also be used as input to the current subtask. For example, when executing subtask C, the analysis results of subtasks A and B can be used as input to the model executing subtask C.
[0096] Finally, the system summarizes the analysis results of each subtask and synthesizes them to generate a complete and logically consistent analysis result for the target dossier.
[0097] In this embodiment of the invention, by breaking down ambiguous analysis instructions into logically rigorous sub-task chains, not only is the problem of model attention dispersion in long text and multi-document scenarios solved, but the analysis process is also ensured to conform to judicial business logic. At the same time, the resource scheduling strategy based on complexity effectively balances the system response speed and analysis depth, enabling the system to maintain high efficiency and accuracy when handling complex and difficult cases.
[0098] The file analysis apparatus provided by the present invention is described below. The file analysis apparatus described below and the file analysis method described above can be referred to in correspondence.
[0099] Figure 4 This is a schematic diagram of the file analysis device provided by the present invention, as shown below. Figure 4 As shown, the device includes: Analysis unit 410 is used to obtain analysis instructions for the target file and generate analysis results for the target file based on the analysis instructions; The retrieval unit 420 is used to retrieve the source tracing knowledge base based on the analysis results to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; The tracing unit 430 is used to construct a mapping relationship between each tracing statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level tracing capability for the target file.
[0100] The case file analysis device provided by this invention constructs a source knowledge base containing line-level positioning identifiers and decomposes the analysis results into source sentences for reverse retrieval of the knowledge base. This establishes a mapping relationship between the generated content and the original case file that is accurate to the line and sentence level, and even the coordinate level. This effectively solves the black box problem that exists in the current application of large models in the judicial field, making every generated analysis conclusion verifiable and traceable. The line-level positioning identifiers allow users to directly access the original source by simply clicking on the analysis result without having to manually search through massive amounts of case files. This greatly improves the efficiency and accuracy of judicial personnel in verifying evidence and reviewing case files, and enhances the interpretability and credibility of the case file analysis results.
[0101] Based on the above embodiments, the source tracing knowledge base also includes the text vector corresponding to the text content of each text block; Retrieval unit 420 is used for: Each source tracing statement is segmented into words to obtain a source tracing word segmentation set, and the source tracing knowledge base is matched based on the source tracing word segmentation set to obtain a first text block set; Each source tracing statement is sparsely encoded to obtain a source tracing sparse vector, and the source tracing sparse vector is sparsely matched with the text vector of each text block in the source tracing knowledge base to obtain a second set of text blocks; Each source statement is densely encoded to obtain a source dense vector. The source dense vector is then densely matched with the text vectors of each text block in the first text block set and the text vectors of each text block in the second text block set to obtain the original text block corresponding to each source statement.
[0102] Based on the above embodiments, the retrieval unit 420 is used for: The first text block set and the second text block set are merged and deduplicated, and the source tracing dense vector is densely matched with the text vector of each text block in the merged and deduplicated text block set to obtain the candidate text block set corresponding to each source statement; Based on at least one of the case type corresponding to the target case file, the analysis type to which the corresponding source statement belongs, and the keyword overlap between the corresponding source statement and each candidate text block in the candidate text block set, the candidate text blocks are filtered to obtain the target text block set corresponding to each source statement. Based on the target text block set, the original text block corresponding to each source statement is determined.
[0103] Based on the above embodiments, the retrieval unit 420 is used for: Based on the matching scores of each text block in the target text block set in text matching, sparse matching and dense matching, determine the text matching score set, sparse matching score set and dense matching score set corresponding to each source statement; The scores of the text matching score set, sparse matching score set, and dense matching score set are normalized respectively. Based on the case type, the score weights of the normalized text matching score set, sparse matching score set, and dense matching score set are determined. Based on the score weights, the normalized text matching score set, sparse matching score set, and dense matching score set are fused to obtain the comprehensive score set corresponding to each source statement; based on the comprehensive score set corresponding to each source statement and the target text block set, the original text block corresponding to each source statement is determined.
[0104] Based on the above embodiments, the mapping relationship includes a single-level mapping relationship or a multi-level mapping relationship; The traceability unit 430 is used for: The generation source of each tracing statement is determined, including direct generation from the target dossier and indirect generation from intermediate conclusions; If any source statement is generated directly from the target file, then the first-level mapping relationship is constructed; the first-level mapping relationship is a key-value storage structure with the line and sentence identifier of the source statement corresponding to the analysis result as the key, and the text content and line-level positioning identifier of the corresponding original text block as the value. If the source of any tracing statement is indirectly generated from an intermediate conclusion, then the multi-level mapping relationship is constructed; the multi-level mapping relationship is a tracing link that starts with the line marker of the tracing statement corresponding to the analysis result and ends with the line marker of the intermediate conclusion that generated the tracing statement.
[0105] Based on the above embodiments, the traceability unit 430 is used for: Responding to a source tracing operation for any source tracing statement in the analysis results; If the mapping relationship of the source statement is the multi-level mapping relationship, then the source is searched level by level along the source link until the target source statement with the first-level mapping relationship is reached. Based on the line-level positioning identifier of the original text block corresponding to the target source statement, a file analysis result with line-level source tracing capability is generated for the target file. If the mapping relationship is the first-level mapping relationship, then based on the line-level positioning identifier of the original text block corresponding to the tracing statement, a file analysis result with line-level tracing capability is generated for the target file.
[0106] Based on the above embodiments, the analysis unit 410 is used for: Based on the analysis instructions, a structured query task is generated, and the structured query task is decomposed to obtain multiple sub-tasks with logical dependencies. Based on the logical dependencies between the multiple subtasks, the multiple subtasks are assembled and ordered for execution to obtain a task execution sequence; Based on the task execution sequence and the task complexity of each subtask, the tasks are executed sequentially to obtain the analysis results of the target dossier.
[0107] Figure 5 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 5As shown, the electronic device may include a processor 510, a communications interface 520, a memory 530, and a communication bus 540, wherein the processor 510, communications interface 520, and memory 530 communicate with each other via the communication bus 540. The processor 510 can call logical instructions in the memory 530 to execute a case file analysis method. This method includes: obtaining analysis instructions for a target case file and generating analysis results for the target case file based on the analysis instructions; retrieving a source tracing knowledge base based on the analysis results to obtain the original text blocks corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks within the target case file, as well as the text content and line-level positioning identifier of each text block; constructing a mapping relationship between each source tracing statement and its corresponding original text block, and generating a case file analysis result with line-level source tracing capability for the target case based on the mapping relationship and the corresponding line-level positioning identifier.
[0108] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0109] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the file analysis method provided by the above methods, the method comprising: obtaining analysis instructions for a target file, and generating analysis results for the target file based on the analysis instructions; retrieving a source tracing knowledge base based on the analysis results to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base containing multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; constructing a mapping relationship between each source tracing statement and its corresponding original text block, and generating a file analysis result with line-level source tracing capability for the target file based on the mapping relationship and the corresponding line-level positioning identifier.
[0110] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the file analysis method provided by the above methods. The method includes: obtaining analysis instructions for a target file and generating analysis results for the target file based on the analysis instructions; retrieving a source tracing knowledge base based on the analysis results to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks within the target file, as well as the text content and line-level positioning identifier of each text block; constructing a mapping relationship between each source tracing statement and its corresponding original text block, and generating a file analysis result for the target file with line-level source tracing capability based on the mapping relationship and the corresponding line-level positioning identifier.
[0111] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the embodiments of the present invention according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0112] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0113] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for analyzing case files, characterized in that, include: Obtain analysis instructions for the target file, and generate analysis results for the target file based on the analysis instructions; Based on the analysis results, the source tracing knowledge base is retrieved to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; Construct a mapping relationship between each source statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level source tracing capability for the target file.
2. The file analysis method according to claim 1, characterized in that, The source knowledge base also contains text vectors corresponding to the text content of each text block; Based on the analysis results, the source tracing knowledge base is retrieved to obtain the original text block corresponding to each source tracing statement in the analysis results, including: Each source tracing statement is segmented into words to obtain a source tracing word segmentation set, and the source tracing knowledge base is matched based on the source tracing word segmentation set to obtain a first text block set; Each source tracing statement is sparsely encoded to obtain a source tracing sparse vector, and the source tracing sparse vector is sparsely matched with the text vector of each text block in the source tracing knowledge base to obtain a second set of text blocks; Each source statement is densely encoded to obtain a source dense vector. The source dense vector is then densely matched with the text vectors of each text block in the first text block set and the text vectors of each text block in the second text block set to obtain the original text block corresponding to each source statement.
3. The file analysis method according to claim 2, characterized in that, The step of densely matching the source-tracing dense vector with the text vectors of each text block in the first text block set and the text vectors of each text block in the second text block set to obtain the original text block corresponding to each source-tracing statement includes: The first text block set and the second text block set are merged and deduplicated, and the source tracing dense vector is densely matched with the text vector of each text block in the merged and deduplicated text block set to obtain the candidate text block set corresponding to each source statement; Based on at least one of the case type corresponding to the target case file, the analysis type to which the corresponding source statement belongs, and the keyword overlap between the corresponding source statement and each candidate text block in the candidate text block set, the candidate text blocks are filtered to obtain the target text block set corresponding to each source statement. Based on the target text block set, the original text block corresponding to each source statement is determined.
4. The file analysis method according to claim 3, characterized in that, The step of determining the original text block corresponding to each tracing statement based on the target text block set includes: Based on the matching scores of each text block in the target text block set in text matching, sparse matching and dense matching, determine the text matching score set, sparse matching score set and dense matching score set corresponding to each source statement; The scores of the text matching score set, sparse matching score set, and dense matching score set are normalized respectively. Based on the case type, the score weights of the normalized text matching score set, sparse matching score set, and dense matching score set are determined. Based on the score weights, the normalized text matching score set, sparse matching score set, and dense matching score set are fused to obtain the comprehensive score set corresponding to each source statement; based on the comprehensive score set corresponding to each source statement and the target text block set, the original text block corresponding to each source statement is determined.
5. The file analysis method according to any one of claims 1 to 4, characterized in that, The mapping relationship includes a single-level mapping relationship or a multi-level mapping relationship; The process of constructing the mapping relationship between each source statement and its corresponding original text block includes: The generation source of each tracing statement is determined, including direct generation from the target dossier and indirect generation from intermediate conclusions; If any source statement is generated directly from the target file, then the first-level mapping relationship is constructed; the first-level mapping relationship is a key-value storage structure with the line and sentence identifier of the analysis result corresponding to any source statement as the key, and the text content and line-level positioning identifier of the corresponding original text block as the value. If the source of any tracing statement is indirectly generated from an intermediate conclusion, then the multi-level mapping relationship is constructed; the multi-level mapping relationship is a tracing link that starts with the line marker of the analysis result corresponding to any tracing statement and ends with the line marker of the intermediate conclusion that generated any tracing statement.
6. The file analysis method according to claim 5, characterized in that, The step of generating a file analysis result with row-level tracing capability for the target file based on the mapping relationship and the corresponding row-level location identifier includes: Responding to a source tracing operation for any source tracing statement in the analysis results; If the mapping relationship of any of the source tracing statements is the multi-level mapping relationship, then the source tracing link is searched level by level until the target source tracing statement with the first-level mapping relationship is traced. Based on the line-level positioning identifier of the original text block corresponding to the target source tracing statement, a file analysis result with line-level tracing capability is generated for the target file. If the mapping relationship is the first-level mapping relationship, then based on the line-level positioning identifier of the original text block corresponding to any of the tracing statements, a file analysis result with line-level tracing capability is generated for the target file.
7. The file analysis method according to any one of claims 1 to 4, characterized in that, The analysis results generated based on the analysis instructions for the target file include: Based on the analysis instructions, a structured query task is generated, and the structured query task is decomposed to obtain multiple sub-tasks with logical dependencies. Based on the logical dependencies between the multiple subtasks, the multiple subtasks are assembled and ordered for execution to obtain a task execution sequence; Based on the task execution sequence and the task complexity of each subtask, the tasks are executed sequentially to obtain the analysis results of the target dossier.
8. A file analysis device, characterized in that, include: An analysis unit is used to obtain analysis instructions for a target file and generate analysis results for the target file based on the analysis instructions. The retrieval unit is used to retrieve the source tracing knowledge base based on the analysis results to obtain the original text block corresponding to each source tracing statement in the analysis results; the source tracing knowledge base contains multiple text blocks in the target file, as well as the text content and line-level positioning identifier of each text block; The tracing unit is used to construct a mapping relationship between each tracing statement and its corresponding original text block, and based on the mapping relationship and the corresponding line-level positioning identifier, generate a file analysis result with line-level tracing capability for the target file.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the file analysis method as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the file analysis method as described in any one of claims 1 to 7.