Document information mining method, electronic device, and storage medium

By combining a window scoring algorithm based on word frequency-inverse document frequency and semantic vectors with knowledge confidence assessment using a large language model, the problem of information extraction from complex biomedical literature was solved, achieving high-precision and high-reliability information extraction and enhancing the automation and data credibility of biomedical literature.

CN122240697APending Publication Date: 2026-06-19国家超级计算天津中心

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
国家超级计算天津中心
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently and accurately extract specialized information from complex biomedical literature, and large language models are prone to misunderstandings and unstable output when lacking sufficient knowledge in the specialized field.

Method used

A window scoring algorithm combining term frequency-inverse document frequency and semantic vectors is used for text window segmentation. Structured data is extracted through a large language model, and the information extraction context enhancement process is triggered when the knowledge confidence is low. A multi-optical character recognition model and a deep retrieval mechanism are introduced for text recognition and knowledge supplementation.

Benefits of technology

It achieves high-precision text parsing and professional knowledge supplementation, improves the automation level and data credibility of literature mining, and enhances the accuracy and reliability of information extraction from biomedical literature.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240697A_ABST
    Figure CN122240697A_ABST
Patent Text Reader

Abstract

This application provides a method, electronic device, and storage medium for literature information mining. The method includes: retrieving relevant literature from an associated database and processing the retrieved relevant literature to obtain corresponding text recognition results; dividing the text recognition results into multiple text windows and determining the relevance score of each text window to the target keyword based on a window scoring algorithm combining word frequency-inverse document frequency and semantic vector similarity, and determining the information in the text window with the highest relevance score as the information extraction context; extracting structured data from the information extraction context using a large language model; calculating the knowledge confidence of the large language model for each target entity in the structured data, and triggering an information extraction context enhancement sub-process when the knowledge confidence is lower than a preset threshold. This achieves an intelligent method for high-precision text parsing, professional knowledge supplementation, and highly reliable information extraction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, specifically to a document information mining method, electronic device, and storage medium. Background Technology

[0002] With the rapid development of life science research, the number of biomedical publications worldwide has exploded. A large number of research findings are published in databases in the form of papers, patents, and technical reports. These publications contain a wealth of important information about disease targets, drug molecular structures, and experimental data, which is of great value to drug development and life science research. However, due to the sheer volume and complex formats of these publications, researchers find it difficult to quickly acquire and organize this information manually.

[0003] Currently, the analysis of biomedical literature mainly employs manual reading or rule-based text processing methods. While manual methods are accurate, they are inefficient and struggle to handle massive amounts of literature data. Rule-based script processing methods can automatically extract some information, but due to the complex formatting and diverse structures of the documents, the rule scripts often need to be rewritten if the document format changes, resulting in poor stability.

[0004] To automate the processing of document content, some systems employ OCR (Optical Character Recognition) technology to convert PDF documents into text. However, traditional OCR technology primarily focuses on character recognition and struggles to understand complex document structures. It performs poorly in cases involving two-column layouts, embedded charts, and mathematical formulas, easily generating erroneous text that further impacts subsequent information extraction results.

[0005] In recent years, large language models have made significant progress in natural language understanding and have been increasingly applied to the field of literature analysis. However, general-purpose large language models often lack specialized biomedical knowledge, making them prone to misunderstandings when dealing with complex scientific research problems. Furthermore, large language models suffer from output instability when generating structured data, such as JSON format errors or missing fields.

[0006] Therefore, there is an urgent need for an intelligent method that can achieve high-precision text parsing, professional knowledge supplementation, and highly reliable information extraction in complex scientific literature environments, so as to improve the automation level and data credibility of biomedical literature mining. Summary of the Invention

[0007] This application aims to provide a literature information mining method, electronic device, and storage medium, which realizes an intelligent method for high-precision text parsing, professional knowledge supplementation, and highly reliable information extraction, thereby improving the automation level and data credibility of literature mining (especially highly specialized literature, such as biomedical literature).

[0008] In a first aspect, embodiments of this application provide a method for mining documentary information, including:

[0009] Based on the research topic input by the user, relevant literature is retrieved from associated databases, and the retrieved literature is processed to obtain the corresponding text recognition results;

[0010] The text recognition results are divided into multiple text windows, and a window scoring algorithm combining term frequency-inverse document frequency and semantic vector similarity is used to determine the relevance score of each text window to the target keyword. The information in the text window with the highest relevance score is determined as the information extraction context.

[0011] Structured data is extracted from the information extraction context using a large language model;

[0012] For each target entity in the structured data, the knowledge confidence of the large language model for the target entity is calculated. When the knowledge confidence is lower than a preset threshold, the information extraction context enhancement sub-process is triggered, and the knowledge confidence of the large language model for the target entity is recalculated based on the enhanced information extraction context.

[0013] When the knowledge confidence level is higher than a preset threshold, the confidence level of the latest information extraction context is determined, and the confidence level and the latest information extraction context are output.

[0014] Secondly, embodiments of this application also provide an electronic device, the electronic device comprising:

[0015] Processor and memory;

[0016] The processor executes the steps of the document information mining method as described in any embodiment by calling programs or instructions stored in the memory.

[0017] Thirdly, embodiments of this application also provide a computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the document information mining method as described in any embodiment.

[0018] In summary, this application proposes a literature information mining method. Based on the research topic input by the user, relevant literature is retrieved from associated databases, and the retrieved literature is processed to obtain corresponding text recognition results. The text recognition results are divided into multiple text windows, and a window scoring algorithm combining term frequency-inverse document frequency and semantic vector similarity is used to determine the relevance score of each text window to the target keyword. The information within the text window with the highest relevance score is determined as the information extraction context. Structured data is extracted from the information extraction context using a large language model. For each target entity in the structured data, the knowledge confidence of the large language model for the target entity is calculated. When the knowledge confidence is lower than a preset threshold, an information extraction context enhancement sub-process is triggered, and the knowledge confidence of the large language model for the target entity is recalculated based on the enhanced information extraction context. When the knowledge confidence is higher than the preset threshold, the credibility of the latest information extraction context is determined, and the credibility and the latest information extraction context are output. This method achieves intelligent methods for high-precision text parsing, professional knowledge supplementation, and highly reliable information extraction, improving the automation and data credibility of literature mining (especially highly specialized literature, such as biomedical literature). Attached Figure Description

[0019] Figure 1 This is a flowchart of a document information mining method provided in an embodiment of this application;

[0020] Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0021] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0022] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0023] Example 1

[0024] Figure 1 This is a flowchart of a document information mining method provided in an embodiment of this application. See also... Figure 1 The document information mining method specifically includes the following steps:

[0025] S110. Retrieve relevant literature from associated databases based on the research topic input by the user, process the retrieved relevant literature, and obtain the corresponding text recognition results.

[0026] Optionally, a search query can be generated based on the user-input research topic (i.e., keywords or key sentences), and relevant literature can be retrieved from associated databases based on the search query. Specifically, the user-input research topic can be directly determined as the search query, or synonyms, related terms, and common expressions can be generated based on the user-input research topic. Then, the generated synonyms, related terms, and common expressions are connected with the user-input research topic using the logical "OR" operator to obtain the corresponding search query.

[0027] The retrieved relevant documents are processed, including deduplication and text recognition.

[0028] S120. Divide the text recognition results into multiple text windows, and determine the relevance score between each text window and the target keyword based on a window scoring algorithm that combines term frequency-inverse document frequency and semantic vector similarity. Then, determine the information in the text window with the highest relevance score as the information extraction context.

[0029] In particular, steps S120 and S130 are essentially semantic localization. In scientific literature, target information is usually distributed in different chapters (such as the abstract, experimental methods or results section). Step S120 automatically identifies the text regions related to the target information through semantic localization algorithm.

[0030] Traditional methods often rely on fixed paragraph headings (such as "Method" and "Result") for content extraction, which becomes ineffective when encountering documents with irregular layouts or cross-paragraph descriptions. This embodiment, through the aforementioned window scoring algorithm combining Term Frequency-Inverse Document Frequency (TF-IDF) and semantic vector similarity, eliminates the dependence on the physical layout structure of documents, achieving accurate "semantic anchoring" of target information in complex contexts. This significantly improves the accuracy and anti-interference capability of downstream large-scale model information extraction.

[0031] For example, the relevance score of each text window to the target keyword is determined by the following formula:

[0032]

[0033]

[0034]

[0035]

[0036] in, Indicates the current text window The relevance score to the target keyword k, where K represents the set of target keywords. and These represent the weighting coefficients, This indicates that the target keyword k is in the current text window. word frequency in This represents the inverse document frequency of the target keyword k within the entire document's text window set. Indicates the entire document The total number of text windows that are divided. This indicates that the target keyword k is in the current text window. The actual number of times it appears in the text Indicates the current text window The total number of all words in the text (i.e., the length of the window). This indicates the number of text windows containing the target keyword k;

[0037]

[0038] Represents semantic vector similarity. Indicates the current text window The semantic embedding vector generated by the encoding model This represents the semantic embedding vector generated by the encoding model from the target keyword set K.

[0039] S130. Extract structured data from the information extraction context using a large language model.

[0040] S140. For each target entity in the structured data, calculate the knowledge confidence of the large language model for the target entity. When the knowledge confidence is lower than a preset threshold, trigger the information extraction context enhancement sub-process, and recalculate the knowledge confidence of the large language model for the target entity based on the enhanced information extraction context.

[0041] A target entity refers to a specific entry in structured data, such as a particular gene or compound.

[0042] The knowledge confidence of a large language model in relation to a target entity refers to how confident the large language model is in understanding the target entity. In other words, knowledge confidence measures the model's own level of understanding of the target entity.

[0043] This knowledge confidence level can be achieved using a pre-trained large language model (such as BERT or GPT). For example, the information extraction context and target entity can be input into a BERT model, which outputs a knowledge confidence level between 0 and 1. Alternatively, the information extraction context, target entity, and natural language prompts (e.g., please judge your understanding of the target entity based on the information extraction context, or please judge your understanding of the molecular mechanism of the target entity based on the information extraction context) can be input into a GPT-type model to obtain the corresponding knowledge confidence level. If the knowledge confidence level is low, the corresponding natural language prompts can be directly used as knowledge blind spot information to serve subsequent enhanced retrieval. The knowledge confidence level can be expressed as:

[0044]

[0045] in, This indicates extracting the context based on known current information. Under the given conditions, the knowledge confidence of the current target entity e.

[0046] If the large language model has low confidence in its knowledge of the target entity, it indicates that the large language model does not have enough knowledge reserves about the current target entity (i.e., the current information extraction context is not sufficient). In order to mine more knowledge related to the current target entity, it is necessary to trigger the information extraction context enhancement subprocess to supplement the large language model's reserve knowledge.

[0047] When large language models deal with highly specialized biomedical problems (such as rare targets, the molecular mechanisms of the latest drugs, etc.), they often face the bottleneck of insufficient built-in domain knowledge. This embodiment significantly enhances the model's information understanding and reasoning ability in specific domains (such as biomedicine) by introducing knowledge confidence and information extraction context enhancement sub-processes.

[0048] S150. When the knowledge confidence level is higher than a preset threshold, determine the confidence level of the latest information extraction context and output the confidence level and the latest information extraction context.

[0049] In some implementations, data from the same experiment are extracted from the latest information extraction context (the experimental data can also be replaced with other specific information, such as the molecular mechanism of the same drug); the credibility of the latest information extraction context is determined based on the variance and weighted average of the extracted data from the same experiment; and the latest information extraction context is converted into a standard structured data format and output.

[0050] Let the dataset for the same experiment extracted from the latest information extraction context be:

[0051] Its weighted average It can be determined by the following formula:

[0052]

[0053]

[0054] in, , and These represent the preset adjustment coefficients. This represents the influence score of the source literature (i.e., the literature that records the data from the i-th experimental group) for the i-th experimental group. This represents the reliability score of the experimental method used to analyze the i-th set of experimental data. This represents the timeliness score of the publication year of the source literature for the i-th set of experimental data. The impact score can be determined based on the journal in which the source literature was published and the journal's standing in academia. Well-known journals can be categorized; for example, an article published in journal 1 might have an impact score of 80, while an article published in journal 2 might have an impact score of 60. For the reliability score of the experimental methods, commonly used experimental methods can be pre-compiled and their corresponding reliability scores specified, allowing for direct lookup during application. The timeliness score for the publication year can be determined according to the principle that the closer the publication year is to the current time, the higher the timeliness score.

[0055] The credibility of the latest information extraction context It can be determined by the following formula:

[0056]

[0057] in, This represents the variance of the data extracted from the same experiment. This represents the weighted average.

[0058] Output the latest information extraction context, including: unified field names and units, generate standard JSON data structure, and support export to JSON, CSV or Excel format.

[0059] By fusing multi-source evidence (which involves extracting data from the same experiment from the latest information extraction context, where experimental data can be replaced with other specific information, such as the molecular mechanism of the same drug, and determining the credibility of the latest information extraction context based on the variance and weighted average of the extracted data from the same experiment), data from different literatures can be comprehensively analyzed and credibility scores can be generated, thereby improving the reliability of scientific research data.

[0060] The literature information mining method provided in this application realizes an intelligent method for high-precision text parsing, professional knowledge supplementation, and highly reliable information extraction, which improves the automation level and data credibility of literature mining (especially highly specialized literature, such as biomedical literature).

[0061] Example 2

[0062] Based on the above embodiments, this embodiment provides an optional implementation for step S110. Specifically, step S110 includes the following sub-steps:

[0063] S111. Generate a unique fingerprint for each relevant document based on its title.

[0064] Specifically, the document title is first normalized, including removing punctuation marks, removing extra spaces, and converting all letters to lowercase. Then, the hash value is calculated to generate a unique fingerprint.

[0065] S112. If multiple related documents with the same fingerprint exist, only one of them should be retained for download to obtain the target document.

[0066] If multiple documents have the same fingerprint, it indicates that there are duplicate documents. In order to achieve the purpose of deduplication, only one of them is kept for download.

[0067] To identify documents with the same title but different content, in addition to the document title, a corresponding fingerprint can be generated by combining metadata such as the document's author and publication date.

[0068] S113. Call multiple optical character recognition models to perform text recognition on the same target document, and determine the text recognition result corresponding to the same target document based on the recognition results of each of the multiple optical character recognition models.

[0069] Because some specialized fields of literature (such as biomedical literature) often employ complex layout structures, such as two-column layouts, embedded charts, and formulas, traditional single OCR (Optical Character Recognition) models are prone to recognition errors. To address this, this embodiment introduces a multi-optical character recognition model collaborative recognition mechanism to perform multiple OCR recognitions on the same document and conduct consistency analysis on the multiple recognition results, thereby generating more reliable text content.

[0070] For example, step S113 further includes the following sub-steps S1131-S1135.

[0071] S1131. Based on the recognition results of multiple optical character recognition models, determine the reference character at each position according to the principle of majority rule.

[0072] For example, if three optical character recognition models are used to recognize the same target document, and the recognition results of two models show that the character at the first position is 'a', while the recognition result of the other model shows that the character at the first position is 'b', then "majority rule" determines that the character at the first position is 'a', and 'a' is determined as the reference character at the first position.

[0073] S1132. Determine the consistency score of the recognition result of each optical character recognition model based on the reference character. The more characters that are the same as the reference character in the recognition result, the higher the consistency score.

[0074] For example, the consistency score between the recognition results of each optical character recognition model is determined by the following formula:

[0075]

[0076]

[0077] in, Indicates consistency score, This indicates the reference character at position j. Let represent the character at position j in the recognition result of optical character recognition model i, H represent the total number of optical character recognition models, and n represent the total number of character positions.

[0078] S1133. Determine the comprehensive semantic similarity score between the recognition results of multiple optical character recognition models and the reference text, where the reference text consists of reference characters at each position.

[0079] The comprehensive semantic similarity score between the recognition result and the reference text measures the high degree of semantic consistency between the content recognized by each OCR model. Traditional OCR verification usually only performs literal comparison. However, in complex biomedical literature, the following situations may occur: for example, one OCR model recognizes "Alpha-helix", while another OCR model recognizes "..." -helix"; or due to misrecognition of spaces or line breaks, character-level matching fails. When processing text around two-column layouts or charts, the paragraph splicing order may vary slightly between different OCR models. Pure character-level comparison (e.g., based solely on consistency scoring) These situations would be judged as "identification errors" and given extremely low scores. However, introducing vector-based semantic similarity... This allows the system to tolerate non-critical character differences, as long as they express the same scientific meaning in a biological context. The system will still assign a high degree of credibility to these differences, thereby improving the accuracy and robustness of text recognition.

[0080] For example, the comprehensive semantic similarity score is determined by the following formula:

[0081]

[0082] in, This represents the overall semantic similarity score. The text semantic embedding vector representing the recognition result of optical character recognition model i. H represents the text semantic embedding vector of the reference text, and H represents the total number of optical character recognition models.

[0083] S1134. Determine the text recognition confidence of each optical character recognition model based on the consistency score and the comprehensive semantic similarity score.

[0084] S1135. If the text recognition confidence level is greater than the text threshold, then the reference text is determined as the corresponding text recognition result.

[0085] If the text recognition confidence score is less than the text threshold, it indicates that there may be a recognition error. In this case, re-recognize, or change the OCR model and re-recognize, and repeat the above steps until the text recognition confidence score is greater than the text threshold.

[0086] This embodiment introduces a multi-optical character recognition model collaborative recognition mechanism to perform multiple OCR recognitions on the same document and conducts consistency analysis on the multiple recognition results, thereby generating more reliable text content. It also incorporates vector-based semantic similarity. This allows the system to tolerate non-critical character differences, as long as they express the same scientific meaning in a biological context. The system will still assign a high degree of credibility to these differences, thereby improving the accuracy and robustness of text recognition.

[0087] Example 3

[0088] Based on the above embodiments, this embodiment provides an optional implementation of the "information extraction context enhancement sub-process" in step S140. Specifically, the "information extraction context enhancement sub-process" includes the following sub-steps:

[0089] S141. Extract context based on the current target entity and current information to generate deep search query information.

[0090] For each target entity in structured data, when the knowledge confidence score is below a preset threshold, it indicates that the large language model's knowledge reserves about the current target entity are insufficient. To mine more knowledge related to the current target entity, it is necessary to trigger the information extraction context enhancement subprocess to supplement the large language model's knowledge reserves. Therefore, when generating deep retrieval query information, the current target entity can be identified as the deep retrieval query information. Alternatively, entries related to the current target entity can be found from a preset mapping relationship (for example, if the target entity is a compound, the corresponding related entries could be the compound's properties, molecular formula, mechanism of action, etc.), and the similarity between each entry and the current information extraction context can be calculated. Entries with lower similarity scores are used as deep retrieval query information. For entries with higher similarity scores, it indicates that the current information extraction context already contains knowledge related to that entry, eliminating the need for repeated retrieval and supplementation. Therefore, only entries with lower similarity scores are used as deep retrieval query information to improve the accuracy of deep retrieval and avoid redundant retrieval.

[0091] Alternatively, a large language model can be used to extract context based on the current target entity and current information, and combined with given prompts (e.g., determine the knowledge gaps related to the current target entity in the current information extraction context, and generate corresponding search terms for those knowledge gaps) to generate deep search query information. Generating deep search query information is a crucial step in achieving "precise knowledge replenishment." It is no longer a simple restatement of the user's question, but rather a set of targeted, structured, and executable deep search instructions generated by the agent based on the current "knowledge gaps" and "contextual logic." The core principle is the mapping from "information needs" to "search instructions." The agent needs to translate the "knowledge gaps related to the current target entity exposed by the current information extraction context" into search language that the database can understand. Alternatively, the type of information associated with the current target entity (e.g., "disease-gene association," "drug-target mechanism of action," "clinical trial results," or "side effect statistics") can be determined from a pre-configured mapping relationship. The determined information type can then be analyzed to determine whether it is missing from the current information extraction context, and the missing information type can be used as the search query information. Furthermore, synonyms and near-synonyms can be added to enrich the search query information.

[0092] Optionally, the knowledge confidence level of the current target entity can also be specifically defined as the knowledge confidence level of a specific content of the target entity (e.g., what is the molecular mechanism of drug A). If the confidence level is low, the corresponding specific content (e.g., what is the molecular mechanism of drug A) can be directly identified as knowledge blind spot information and directly identified as deep search query information.

[0093] S142. Relevant literature is obtained from multiple professional databases based on in-depth search queries, and the literature with the highest relevance is selected by semantic similarity.

[0094] This involves calculating the semantic similarity between the retrieved documents and the deep search query information, and identifying documents with a semantic similarity that reaches a certain threshold as the most relevant documents.

[0095] S143. Analyze the most relevant literature and extract new knowledge to address the current knowledge gaps.

[0096] Optionally, the most relevant literature can be analyzed to extract new knowledge addressing the current knowledge gap. This includes: extracting key entities (genes, proteins, compounds, diseases, pathways, etc.) from the literature; identifying relationships between entities (e.g., "TP53 inhibits cell proliferation," "Aspirin reduces the inflammatory factor IL-6"); extracting experimental parameters (sample size, control group, dosage, time points, statistical methods, etc.); and extracting conclusion statements (locating core conclusion statements from the literature, such as "This study found that TP53 mutation is significantly associated with chemotherapy resistance"). The extracted information is then processed into structured data. This structured data is semantically correlated with information in the current knowledge gap, and data with a certain semantic similarity is identified as content that can fill the knowledge gap. Alternatively, all of the structured data can be used as new knowledge addressing the current knowledge gap.

[0097] Optionally, the Reflector from the ACE (Advanced Cognitive Engine) framework can be invoked. The Reflector analyzes the most relevant retrieved literature to extract specific biological facts, experimental parameter logic, or common misconceptions (Insights) addressing the current knowledge gap. The Reflector does not require specialized training and can utilize existing large language models. Its input consists of the full text or abstract of the most relevant retrieved literature, along with prompts (e.g., "Extract key biological facts" or "Information on current knowledge gaps"); the output is a set of core insights extracted from the literature by the large model.

[0098] In some implementations, the entire content of the most relevant document can be directly identified as new knowledge.

[0099] S144. The new knowledge is synthesized into structured incremental entries, and the incremental entries are appended to the current information extraction context and deduplication is performed to obtain the enhanced information extraction context.

[0100] For example, similarity calculation based on embedded vectors is used to determine the similarity between each new piece of knowledge and existing knowledge. If the similarity is less than a threshold, it is determined to be a new entry.

[0101] Alternatively, new knowledge and existing knowledge (i.e., the current information extraction context) can be input into a large model, allowing the large model to determine the differences between them and output difference entries. These difference entries can then be added as incremental entries to the current information extraction context.

[0102] This embodiment innovatively introduces the concept of deep retrieval and proposes an "observation-trigger" mechanism. When a low level of knowledge confidence is observed, a deep retrieval process is triggered, transforming the retrieved literature into continuously evolving "knowledge playbooks." Key biological laws and facts are extracted and accumulated, and the context is dynamically supplemented through incremental updates. This significantly enhances the model's information understanding and reasoning capabilities in the biomedical field without losing existing details.

[0103] Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. For example... Figure 2 As shown, the electronic device 500 includes one or more processors 501 and memory 502.

[0104] The processor 501 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 500 to perform desired functions.

[0105] The memory 502 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and / or cache memory. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 501 may execute the program instructions to implement the document information mining method of any embodiment of this application described above and / or other desired functions. Various contents such as initial extrinsic parameters and thresholds may also be stored in the computer-readable storage medium.

[0106] In one example, the electronic device 500 may further include an input device 503 and an output device 504, these components being interconnected via a bus system and / or other forms of connection mechanisms (not shown). The input device 503 may include, for example, a keyboard, a mouse, etc. The output device 504 may output various information to the outside, including warning messages, braking force, etc. The output device 504 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0107] Of course, for the sake of simplicity, Figure 2 Only some of the components of the electronic device 500 relevant to this application are shown in this illustration; components such as buses, input / output interfaces, etc., are omitted. In addition, the electronic device 500 may include any other suitable components depending on the specific application.

[0108] In addition to the methods and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps of the document information mining method provided in any embodiment of this application.

[0109] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0110] Furthermore, embodiments of this application may also be computer-readable storage media storing computer program instructions thereon, which, when executed by a processor, cause the processor to perform the steps of the document information mining method provided in any embodiment of this application.

[0111] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.

[0112] It should be noted that the terminology used in this application is for the purpose of describing specific embodiments only and is not intended to limit the scope of this application. As shown in the specification and claims of this application, unless the context clearly indicates otherwise, words such as "a," "an," "an," and / or "the" do not specifically refer to the singular and may also include the plural. The terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, or apparatus. Without further limitations, an element defined by the phrase "comprising an..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element.

[0113] It should also be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, and are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation on this application. Unless otherwise expressly specified and limited, the terms "installed," "connected," "linked," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication between two elements. For those skilled in the art, the specific meaning of the above terms in this application can be understood according to the specific circumstances.

[0114] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. The above descriptions are merely preferred embodiments of this application. It should be noted that due to the limitations of written expression, while there are objectively infinite specific structures, those skilled in the art can make several improvements, modifications, or changes without departing from the principles of this invention, and can also combine the above technical features in an appropriate manner. These improvements, modifications, changes, or combinations, or the direct application of the inventive concept and technical solution to other situations without modification, should all be considered within the scope of protection of this application.

Claims

1. A method for mining documentary information, characterized in that, include: Based on the research topic input by the user, relevant literature is retrieved from associated databases, and the retrieved literature is processed to obtain the corresponding text recognition results; The text recognition results are divided into multiple text windows, and a window scoring algorithm combining term frequency-inverse document frequency and semantic vector similarity is used to determine the relevance score of each text window to the target keyword. The information in the text window with the highest relevance score is determined as the information extraction context. Structured data is extracted from the information extraction context using a large language model; For each target entity in the structured data, the knowledge confidence of the large language model for the target entity is calculated. When the knowledge confidence is lower than a preset threshold, the information extraction context enhancement sub-process is triggered, and the knowledge confidence of the large language model for the target entity is recalculated based on the enhanced information extraction context. When the knowledge confidence level is higher than a preset threshold, the confidence level of the latest information extraction context is determined, and the confidence level and the latest information extraction context are output.

2. The document information mining method according to claim 1, characterized in that, The process of processing the retrieved relevant documents to obtain the corresponding text recognition results includes: Generate a unique fingerprint for each relevant document based on its title; If multiple related documents with the same fingerprint exist, only one should be retained for download to obtain the target document; Multiple optical character recognition models are invoked to perform text recognition on the same target document, and the text recognition result corresponding to the same target document is determined based on the recognition results of the multiple optical character recognition models.

3. The document information mining method according to claim 2, characterized in that, The step of determining the text recognition result corresponding to the same target document based on the recognition results of the multiple optical character recognition models includes: Based on the recognition results of multiple optical character recognition models, the reference character at each position is determined according to the principle of majority rule. The consistency score of the recognition result of each optical character recognition model is determined based on the reference character. The more characters that are the same as the reference character in the recognition result, the higher the consistency score. Determine the comprehensive semantic similarity score between the recognition results of multiple optical character recognition models and a reference text, wherein the reference text is composed of reference characters at each position; The text recognition confidence of each optical character recognition model is determined based on the consistency score and the comprehensive semantic similarity score. If the text recognition confidence score is greater than the text threshold, then the reference text is determined as the corresponding text recognition result.

4. The document information mining method according to claim 3, characterized in that, The consistency score for determining the recognition result of each optical character recognition model based on the reference character includes: The consistency score is determined using the following formula: in, Indicates consistency score, This indicates the reference character at position j. Let represent the character at position j in the recognition result of optical character recognition model i, H represent the total number of optical character recognition models, and n represent the total number of character positions.

5. The document information mining method according to claim 3, characterized in that, The determination of the comprehensive semantic similarity score between the recognition results of multiple optical character recognition models and the reference text includes: The comprehensive semantic similarity score is determined using the following formula: in, This represents the comprehensive semantic similarity score. The text semantic embedding vector representing the recognition result of optical character recognition model i. H represents the text semantic embedding vector of the reference text, and H represents the total number of optical character recognition models.

6. The document information mining method according to claim 1, characterized in that, The window scoring algorithm, which combines term frequency-inverse document frequency and semantic vector similarity, determines the relevance score between each text window and the target keyword, including: The relevance score between each text window and the target keyword is determined using the following formula: in, Indicates the current text window The relevance score to the target keyword k, where K represents the set of target keywords. and These represent the weighting coefficients, This indicates that the target keyword k is in the current text window. word frequency in This represents the inverse document frequency of the target keyword k within the entire document's text window set. Indicates the entire document The total number of text windows that are divided. This indicates that the target keyword k is in the current text window. The actual number of times it appears in the text Indicates the current text window The total number of all words in the Chinese dictionary. This indicates the number of text windows containing the target keyword k; Represents semantic vector similarity. Indicates the current text window The semantic embedding vector generated by the encoding model This represents the semantic embedding vector generated by the encoding model from the target keyword set K.

7. The document information mining method according to claim 1, characterized in that, The trigger information extraction context enhancement sub-process includes: Based on the current target entity and current information, extract context to generate deep search query information; Based on the deep search query, relevant literature is obtained from multiple professional databases, and the literature with the highest relevance is filtered out by semantic similarity. Analyze the most relevant literature to extract new knowledge that addresses current knowledge gaps; The new knowledge is synthesized into structured incremental entries, which are then appended to the current information extraction context and deduplicated to obtain an enhanced information extraction context.

8. The document information mining method according to claim 1, characterized in that, The process of determining the credibility of the latest information extraction context and outputting the credibility and the latest information extraction context includes: Extract data from the same experiment from the latest information extraction context; The credibility of the latest information extraction context is determined based on the variance and weighted average of the extracted data from the same experiment. The latest information extraction context is converted into a standard structured data format and output.

9. An electronic device, characterized in that, The electronic device includes: Processor and memory; The processor executes the steps of the document information mining method as described in any one of claims 1 to 8 by calling the program or instructions stored in the memory.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a program or instructions that cause a computer to perform the steps of the document information mining method as described in any one of claims 1 to 8.