Literature extraction method, system, device and storage medium based on large language model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing document extraction templates and using large language models for document analysis, the accuracy and efficiency issues of document extraction in existing technologies have been resolved. This has enabled efficient and standardized document information extraction, adapting to various document formats and supporting batch processing.

CN120409450BActive Publication Date: 2026-06-30PEKING UNIV

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: PEKING UNIV
Filing Date: 2025-07-02
Publication Date: 2026-06-30

Smart Images

Figure CN120409450B_ABST

Patent Text Reader

Abstract

This invention provides a method, system, device, and storage medium for literature extraction based on a large language model, relating to the field of medical research technology. The method includes: constructing a literature extraction template according to literature extraction requirements; wherein the literature extraction requirements include extraction rules, format requirements, and pre-extraction fields, the pre-extraction fields including: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group; acquiring the literature to be extracted and converting it into a processable text format to form a first text; inputting the first text and the literature extraction template into a large language model, applying the large language model to read and analyze the first text, and extracting the literature according to the literature extraction template to obtain the literature extraction results corresponding to the literature to be extracted. This invention improves the accuracy and efficiency of literature extraction and can better meet the needs of systematic reviews.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical research technology, and in particular to a method, system, device and storage medium for extracting documents based on a large language model. Background Technology

[0002] A literature review is the process of summarizing, analyzing, and organizing existing research findings in a specific field or academic problem. It is an important method for researchers to comprehensively grasp the current research status and formulate research questions. Almost all research relies on literature reviews, which help researchers understand the current state of their field, avoid duplication of effort, and provide a theoretical foundation for subsequent research.

[0003] Besides general literature reviews, systematic reviews, as a more rigorous and standardized form of review, aim to comprehensively and systematically collect, evaluate, and synthesize all relevant research on a specific research topic to draw comprehensive conclusions. It is widely used in fields such as medicine, social sciences, and education, and is a key step in evidence-based research and decision-making. In practice, systematic reviews require the broadest possible database search, often necessitating the reading and extraction of a large number of documents, which undoubtedly increases the researcher's workload and requires significant time and human resources.

[0004] In existing technologies, manual extraction has a high probability of resulting in missed or incorrect questions. Currently, there is no good solution other than dual-person extraction, which further increases time costs. Manual extraction mainly relies on the extractor's own understanding of the literature research, is highly proactive, and the format of the extracted information is also based on the extractor's own interpretation, making it difficult to standardize. In addition, systematic reviews require standardized and regulated information extraction. Most current technologies used for literature reviews are designed for general review writing and lack standardized information extraction methods, which are insufficient to meet the needs of systematic reviews.

[0005] Therefore, existing technical solutions have poor accuracy and efficiency in document extraction. Summary of the Invention

[0006] This invention provides a document extraction method, system, device, and storage medium based on a large language model, which addresses the shortcomings of poor accuracy and efficiency in document extraction in existing technologies, and improves the accuracy and efficiency of document extraction based on a large language model.

[0007] This invention provides a document extraction method based on a large language model, comprising the following steps:

[0008] Based on the literature extraction requirements, a literature extraction template is constructed; wherein, the literature extraction requirements include extraction rules, format requirements and pre-extraction fields, and the pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group;

[0009] Obtain the documents to be extracted and convert them into a processable text format to form the first text;

[0010] The first text and the document extraction template are input into a large language model. The large language model is used to read and analyze the first text, and the first text is extracted according to the document extraction template to obtain the document extraction result corresponding to the document to be extracted.

[0011] According to the present invention, a document extraction method based on a large language model is provided, wherein the document extraction result is an extraction table; the step of applying the large language model to read and analyze the first text, and extracting the first text according to the document extraction template to obtain the document extraction result corresponding to the document to be extracted includes:

[0012] The first text is divided according to preset structured rules to obtain multiple text fragments, forming the second text;

[0013] The large language model is applied to read and analyze the second text, and information corresponding to the pre-extracted fields is extracted according to the extraction rules and format requirements to form the third text;

[0014] The third text is converted into a table format to obtain the extraction table corresponding to the document to be extracted.

[0015] According to the present invention, a document extraction method based on a large language model is provided, wherein the large language model is applied to perform reading analysis on the second text, and information corresponding to the pre-extracted fields is extracted according to the extraction rules and the format requirements to form a third text, including:

[0016] Receive the second text and perform in-depth analysis on the text information corresponding to each subtitle;

[0017] Using the large language model and an extraction table based on the PICO format, information corresponding to the pre-extracted fields is extracted one by one from the text information corresponding to each subtitle to form the third text.

[0018] According to the document extraction method based on a large language model provided by the present invention, the method applies the large language model and extracts information corresponding to pre-extracted fields one by one from the text information corresponding to each subheading according to the extraction table based on the PICO format, forming the third text, including:

[0019] The basic information and research design type of the study are extracted from the text information corresponding to the title and abstract subtitle. The basic information includes: the first author of the study, the year of publication, and the countries of the participants. The research design type includes at least one of the following: randomized controlled trial, cohort study, and case-control study.

[0020] Research groups are extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables;

[0021] Participant features are extracted from the text information corresponding to the title, the abstract, and the subtitle of the method.

[0022] Interventions for each study group are extracted from the text information corresponding to the subheadings of the abstract, the method, and the baseline information table; the interventions for each study group include: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration;

[0023] The background medication for each study group is extracted from the text information corresponding to the subtitle of the method.

[0024] According to the literature extraction method based on a large language model provided by the present invention, the research design type is a randomized controlled trial; the method further includes:

[0025] The experimental registration platform and registration number are extracted from the text information corresponding to the subtitle of the method.

[0026] According to the document extraction method based on a large language model provided by the present invention, after obtaining the document extraction result corresponding to the document to be extracted, the method further includes:

[0027] The document extraction results are standardized, and the standardization process includes at least one of the following: deleting useless descriptive information, standardizing information format, and adding a mark to information that cannot be extracted;

[0028] Store the extracted results of the normalized literature.

[0029] According to the document extraction method based on a large language model provided by the present invention, there are multiple documents to be extracted; after obtaining the document extraction results corresponding to the documents to be extracted, the method further includes:

[0030] Determine whether all documents to be extracted have been extracted; if all documents to be extracted have been extracted, integrate and output the extraction results corresponding to all documents to be extracted; otherwise, return to the step of obtaining documents to be extracted.

[0031] This invention also provides a document extraction system based on a large language model, comprising the following modules:

[0032] The construction module is used to build a literature extraction template based on the literature extraction requirements. The literature extraction requirements include extraction rules, format requirements, and pre-extraction fields. The pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group.

[0033] The acquisition module is used to acquire the documents to be extracted and convert the documents to be extracted into a processable text format to form the first text;

[0034] The extraction module is used to input the first text and the document extraction template into the large language model, apply the large language model to read and analyze the first text, and extract the first text according to the document extraction template to obtain the document extraction result corresponding to the document to be extracted.

[0035] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the document extraction method based on the large language model as described above.

[0036] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the document extraction method based on a large language model as described above.

[0037] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the document extraction method based on a large language model as described above.

[0038] The document extraction method, system, device, and storage medium based on a large language model provided by this invention, by constructing a document extraction template, clarifies the specific requirements for document extraction, including extraction rules, format requirements, and pre-extraction fields. This gives the entire extraction process a clear goal and direction, and the template construction standardizes and normalizes the extraction process. The large language model can operate directly according to the template without redefining extraction rules and formats each time, thus greatly improving extraction efficiency. Furthermore, by converting the documents to be extracted into a processable text format to form the first text, subsequent processing is facilitated, improving the system's compatibility and versatility to handle multiple document formats. Further, the first text and the document extraction template are input into the large language model, which reads and analyzes the first text and extracts information according to the document extraction template to obtain the document extraction results corresponding to the documents to be extracted. This reduces the tedious process of manual extraction, improves extraction efficiency, and the large language model can understand complex language structures and contextual relationships, enabling it to extract information according to the document extraction template, improving the accuracy and consistency of extracted information. Therefore, the solution of this application improves the accuracy and efficiency of document extraction. Attached Figure Description

[0039] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0040] Figure 1 This is one of the flowcharts of the document extraction method based on a large language model provided by the present invention.

[0041] Figure 2 This is the second flowchart of the document extraction method based on a large language model provided by the present invention.

[0042] Figure 3 This is a schematic diagram of the document extraction system based on a large language model provided by the present invention.

[0043] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0045] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.

[0046] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities and do not necessarily imply a specific order or sequence, unless otherwise indicated. It should be understood that such terms can be used interchangeably where appropriate, for example, in situations where implementation can proceed in an order other than those given in the embodiments illustrated or described in this application.

[0047] Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover but not exclude inclusion. For example, a product or device that includes a series of components is not necessarily limited to those explicitly listed, but may include other components not explicitly listed or inherent to such product or device. As used in this application, the term "module" means any known or subsequently developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and / or software code capable of performing the functions associated with that element.

[0048] The technical solution of this application and how it solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The following is a combination of... Figure 1 and Figure 2 This invention describes a document extraction method based on a large language model.

[0049] For example, the following explanation will use a document extraction system based on a large language model as the main body for implementing the document extraction method based on a large language model.

[0050] Figure 1 This is one of the flowcharts illustrating the document extraction method based on a large language model provided by this invention, such as... Figure 1 As shown, the method includes steps 101 to 103.

[0051] Step 101: Construct a document extraction template based on the document extraction requirements.

[0052] Step 102: Obtain the documents to be extracted and convert them into a processable text format to form the first text.

[0053] Step 103: Input the first text and the document extraction template into the large language model, use the large language model to read and analyze the first text, and extract the first text according to the document extraction template to obtain the document extraction results corresponding to the document to be extracted.

[0054] In practical applications, the execution entity of this large language model-based document extraction method can be a large language model-based document extraction system. There are various ways to implement such a system. For example, it can be implemented through computer programs, such as application software; or, for example, chips. It can also be implemented as a medium storing the relevant computer programs, such as USB flash drives or cloud storage; or, alternatively, it can be implemented through physical devices that integrate or install the relevant computer programs, such as servers or smart devices.

[0055] Specifically, step 101 includes: constructing a document extraction template based on the document extraction requirements.

[0056] The literature extraction requirements include extraction rules, format requirements, and pre-extraction fields. The pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group.

[0057] In this embodiment, extraction rules refer to specific instructions or standards used to guide the large language model in identifying and extracting specific information from documents during the document extraction process. These rules are typically formulated based on the document's structure, content characteristics, and research needs.

[0058] In this embodiment, format requirements refer to the specific specifications for the organization, presentation, and storage of information during information extraction. These requirements ensure that the extracted information has a uniform structure and format, facilitating subsequent analysis and comparison. For example, format requirements may include the format of the literature extraction results.

[0059] For example, basic research information includes, but is not limited to: the first author of the study, the year of publication, and the countries of the participants. Study design types include, but are not limited to: randomized controlled trials, cohort studies, case-control studies, cross-sectional studies, ecological studies, case reports, and case series.

[0060] In practical applications, in medical research, study groups refer to the allocation of research subjects (such as patients, subjects, samples, etc.) into different groups according to the research design in order to compare the effects of different treatments or interventions. Study grouping is a core part of experimental design, especially in research types such as randomized controlled trials, cohort studies, and case-control studies.

[0061] Specifically, the study groups were identified based on the textual information corresponding to the subheadings of the abstract, methods, and baseline information table. While fully preserving the original text, the groups were identified and extracted according to population characteristics and intervention methods. Control groups were extracted according to the original description, such as placebo and control.

[0062] For example, participant characteristics include, but are not limited to: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text. Specifically, based on the textual information corresponding to the title, abstract, and method subtitle, the participant characteristics of each group are extracted. Inclusion criteria from the references and other relevant content are also used. Common characteristics of participants in each group are extracted according to the original text description and integrated into six elements: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text.

[0063] In medical research, an intervention refers to a treatment or intervention applied to research subjects to assess its impact on health outcomes. For example, interventions for each research group may include, but are not limited to: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and route of administration.

[0064] Specifically, based on the text information corresponding to the subheadings of the abstract, methods, and baseline information table, the interventions for each study group in each group were extracted, including the drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration.

[0065] Background medication refers to the drugs or treatments used by all research groups (including experimental and control groups) during medical research, that is, drugs used simultaneously with intervention drugs and used in all groups.

[0066] Specifically, based on the text information corresponding to the method subtitle, the background medication for each study group is extracted, i.e., the medication used concurrently with the intervention drug and used in all groups. In practice, the literature extraction system based on the large language model will determine whether it is the background medication for each study group through the large language model, and output the drug name or drug type according to the original text description, such as metformin, antidiabetic drugs, etc.

[0067] Understandably, by constructing a document extraction template, the specific requirements for document extraction are clarified, including extraction rules, format requirements, and pre-extraction fields. This gives the entire extraction process a clear goal and direction. Furthermore, the construction of the template standardizes and normalizes the extraction process. Large language models can operate directly based on the template without redefining extraction rules and formats each time, thereby greatly improving extraction efficiency.

[0068] Further, step 102 includes: acquiring the document to be extracted and converting the document to be extracted into a processable text format to form a first text.

[0069] In practical applications, the literature to be extracted can be provided by the researcher. For example, a literature extraction system based on a large language model may have an input interface or interface, allowing researchers to directly input the literature to be extracted. Another example is a literature extraction system based on a large language model that connects to a smart device, allowing researchers to send the literature to be extracted to the system via the smart device. Yet another example is a literature extraction system based on a large language model that has a human-computer interaction interface, allowing researchers to input keywords to search for relevant literature. Researchers can then select the literature to be extracted from the search results, and after selection, the system retrieves the literature and begins the extraction process.

[0070] In this embodiment, documents in various formats can be extracted. For example, the documents to be extracted can be PDF files, Word documents, HTML files, XML files, plain text files, Excel files, etc.

[0071] In this context, "processable text format" refers to a text format that facilitates subsequent natural language processing and information extraction. For example, a processable text format can be plain text, structured text (such as HTML and XML), tokenized text, JSON, etc. In practice, the processable text format can be determined based on the specific document extraction requirements and application scenarios; no specific limitations are imposed here.

[0072] Understandably, by supporting multiple document formats and converting them into a processable text format, it is possible to better adapt to different input needs and improve the flexibility and efficiency of document extraction. Acquiring the documents to be extracted and converting them into a processable text format to form the first text facilitates subsequent processing, improving the system's compatibility and versatility, enabling it to handle multiple document formats.

[0073] Specifically, step 103 includes: inputting the first text and the document extraction template into the large language model, applying the large language model to read and analyze the first text, and extracting the first text according to the document extraction template to obtain the document extraction result corresponding to the document to be extracted.

[0074] In practical applications, the first text and the document extraction template are input into the large language model. The large language model utilizes its powerful language understanding capabilities to read and analyze the first text, accurately identifying and extracting key information from the document, reducing errors and omissions that may occur during manual extraction. Furthermore, the large language model extracts information from the first text according to the document extraction template, yielding document extraction results that meet the required standards.

[0075] In this embodiment, the format of the literature extraction results is not specifically limited. As an example, the format of the literature extraction results includes at least one of the following: extraction table, hierarchical diagram, tree diagram, and mind map.

[0076] Alternatively, in one possible implementation, Figure 2 This is the second flowchart of the document extraction method based on a large language model provided by the present invention. Based on the above embodiments, the document extraction result is an extraction table; step 103 includes:

[0077] Step 201: Input the first text and the document extraction template into the large language model;

[0078] Step 202: Divide the first text according to the preset structured rules to obtain multiple text fragments, forming the second text;

[0079] Step 203: Apply the large language model to read and analyze the second text, and extract the information corresponding to the pre-extracted fields according to the extraction rules and format requirements to form the third text;

[0080] Step 204: Convert the third text into a table format to obtain the extraction table corresponding to the document to be extracted.

[0081] In this application, no specific limitations are made on the method of inputting the first text and the document extraction template into the large speech model. In one example, the first text and the document extraction template are directly input into the large speech model. In another example, prompt words are generated based on the first text and the document extraction template, and then the prompt words are input into the large speech model.

[0082] The pre-defined structured rules refer to the pre-defined division logic based on the common format and content structure of a document. For example, the pre-defined structured rules can be based on structural information such as titles, paragraphs, and keywords in the document. For instance, paragraphs in the document can be divided into multiple text segments, such as dividing the document into multiple text segments according to the order of the paragraphs, with each text segment consisting of 5 paragraphs.

[0083] Optionally, the documents to be extracted can be divided according to their subtitles. In some examples, step 202 above specifically includes:

[0084] The first text is divided according to the subtitles in the documents to be extracted, and the text information corresponding to each subtitle is obtained to form the second text; the subtitles include: title, abstract, background introduction, methods, and baseline information table.

[0085] Specifically, the first text is divided according to the subheadings in the document to be extracted, and the text information corresponding to each subheading is obtained to form the second text. Each subheading's text information is treated as a text segment, and each text segment corresponds to a logical part of the document.

[0086] In practical applications, the segmented text fragments are organized into a second text, with each fragment bearing a clear identifier (such as a subtitle) to facilitate subsequent processing. The second text is a structured data collection; for example, it can be stored as a list or a dictionary.

[0087] Understandably, on the one hand, through structured segmentation, the system can quickly locate the areas containing key information, reducing the time spent processing invalid information. Automated segmentation reduces the time spent on manual reading and segmentation of documents, significantly improving extraction efficiency. On the other hand, structured segmentation enables the system to extract information more accurately, reducing errors caused by unclear information locations. Each segment has a clear identifier, facilitating targeted processing in subsequent steps. Furthermore, the segmented second text is a structured data set, facilitating subsequent information extraction and analysis. Through preset structured rules, the system can adapt to documents of different formats, improving its versatility and flexibility.

[0088] Optionally, step 203 above includes:

[0089] Receive the second text and perform in-depth analysis on the text information corresponding to each subheading;

[0090] Using a large language model and an extraction table based on the PICO format, information corresponding to the pre-extracted fields is extracted one by one from the text information corresponding to each subtitle to form the third text.

[0091] The PICO (Population, Intervention, Comparison, Outcome) format is a widely used framework in evidence-based medicine and medical literature research. It's used to clearly define and organize the key elements of a clinical or research question. PICO is an acronym for Population, Intervention, Comparison, and Outcome. This format helps researchers and clinicians clearly define research questions, improving the efficiency and accuracy of literature retrieval, study design, and results analysis.

[0092] In practical applications, the literature extraction system based on the large language model is supplemented with an extraction table formed by extensive discussions among experts in the field of evidence-based medicine based on the PICO format. This table extracts key information from the second text, including basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group, one by one, according to the original text content.

[0093] Optionally, in one possible implementation, the above-mentioned large language model is used to extract information corresponding to the pre-extracted fields one by one from the text information corresponding to each subtitle according to the extraction table based on the PICO format, forming the third text, including:

[0094] The basic research information and research design type are extracted from the text information corresponding to the title and abstract subtitle. The basic research information includes: the first author of the study, the year of publication, and the countries of the participants. The research design type includes at least one of the following: randomized controlled trial, cohort study, case-control study, cross-sectional study, ecological study, case report, and case series.

[0095] Research groups were extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables;

[0096] Participant characteristics were extracted from the textual information corresponding to the title, abstract, and method subtitle.

[0097] Interventions for each study group were extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables. The interventions for each study group included: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and route of administration.

[0098] Background medications for each study group were extracted from the text information corresponding to the subtitle of the method.

[0099] Specifically, the system extracts the first author, publication year, participant countries, and study design type from the text information corresponding to the title and abstract subtitle. If it is a randomized controlled trial (RCT), the system will further extract the experimental registration platform and registration number.

[0100] Specifically, the study groups were identified based on the textual information corresponding to the subheadings of the abstract, methods, and baseline information table. While fully preserving the original text, the groups were identified and extracted according to population characteristics and intervention methods. Control groups were extracted according to the original description, such as placebo and control.

[0101] Specifically, based on the textual information corresponding to the title, abstract, and method subtitle, the characteristics of participants in each group were extracted, along with the inclusion criteria in the references and other relevant content. The common characteristics of participants in each group were extracted according to the original text description and integrated into six elements: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text.

[0102] Specifically, based on the text information corresponding to the subheadings of the abstract, methods, and baseline information table, the interventions for each study group in each group were extracted, including the drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration.

[0103] Specifically, based on the text information corresponding to the method subtitle, the background medication for each study group is extracted, i.e., the medication used concurrently with the intervention drug and used in all groups. In practice, the literature extraction system based on the large language model will determine whether it is the background medication for each study group through the large language model, and output the drug name or drug type according to the original text description, such as metformin, antidiabetic drugs, etc.

[0104] Understandably, by applying a large language model to read and analyze the second text, and extracting information corresponding to preset fields according to extraction rules and format requirements to form the third text, the tedious process of manual extraction is reduced, and the extraction efficiency is improved. Furthermore, the large language model can understand complex language structures and contextual relationships, and extract information according to preset rules, thereby improving the accuracy and consistency of information extraction.

[0105] Optionally, in some possible implementations, the above-described study design type is a randomized controlled trial, and the above-described method further includes:

[0106] The experimental registration platform and registration number are extracted from the text information corresponding to the method subtitle.

[0107] Further, step 204 includes: converting the third text into a table format to obtain the extraction table corresponding to the document to be extracted.

[0108] It is understandable that converting the third-party text into a tabular format to obtain and store the extraction table corresponding to the document to be extracted facilitates subsequent analysis and comparison, improves the readability and usability of the information, and facilitates further research by researchers. Therefore, the solution in this embodiment improves the accuracy and efficiency of document extraction.

[0109] It should be noted that third-party text (i.e., the extracted information collection) can be stored not only in tabular form but also in various other ways, depending on subsequent usage requirements and application scenarios. For example, third-party text can be stored in JSON format, databases (such as MySQL and MongoDB), XML format, etc.

[0110] Furthermore, to further improve the accuracy of document extraction, the extraction results can be standardized. In one possible implementation, after step 103, the method further includes:

[0111] The literature extraction results are standardized, including at least one of the following: deleting useless descriptive information, standardizing information format, and adding marks to information that cannot be extracted;

[0112] Store the extracted results of the normalized literature.

[0113] In practical applications, removing useless descriptive information makes the extracted results more concise and clear, avoids interference from irrelevant information in subsequent analysis, improves data quality, reduces the complexity of subsequent processing, and saves time and computing resources. For example, after extracting the intervention for each study group within each group, the information is normalized by removing symbols such as "®" and "TM" from product names.

[0114] Understandably, a standardized information format ensures consistent formatting of extracted information, facilitating subsequent comparisons and analyses. Furthermore, a standardized format makes the extracted results easier to understand and use. For example, after extracting the intervention for each study group within each subgroup, the dosage and dosing frequency are summarized in a "dosage / time" format, such as 0.1 mg / day; the dosing time is simplified to numbers and units, such as 14 days; and the route of administration is summarized as inhalation, oral, or injection.

[0115] Understandably, marking information that cannot be extracted clearly helps researchers understand the completeness of the data, providing a reference for subsequent research or data supplementation and preventing the omission of important information. Furthermore, marking missing information improves the reliability and credibility of the data. For example, checking the information corresponding to the pre-extracted fields and outputting "NA" if the relevant information is not mentioned in the original text.

[0116] Optionally, normalization may also include removing redundant information, data cleaning, standardizing terminology, unifying date and numerical formats, unifying logical structures, and marking uncertain information, etc., without specific limitations.

[0117] In practical applications, researchers need to study multiple research papers. The method of this invention can extract information from multiple research papers in batches. Optionally, in one possible implementation, there are multiple papers to be extracted; after step 103 above, the method further includes:

[0118] Determine whether all documents to be extracted have been extracted; if all documents to be extracted have been extracted, integrate and output the extraction results corresponding to all documents to be extracted; otherwise, return to step 102 above.

[0119] This implementation supports the extraction of data from multiple research articles, ensuring that all articles undergo a complete extraction process and integrating the results into a comprehensive bibliographic information table. This approach not only improves the completeness and efficiency of data processing but also provides high-quality data support for subsequent comprehensive analysis. Through automated judgment and iterative processing, the system can efficiently process a large number of articles, reducing human error and improving data reliability and consistency.

[0120] The document extraction method based on a large language model provided in this embodiment clarifies the specific requirements for document extraction by constructing a document extraction template, including extraction rules, format requirements, and pre-extraction fields. This gives the entire extraction process a clear goal and direction, and the template construction standardizes and normalizes the extraction process. The large language model can operate directly according to the template without redefining extraction rules and formats each time, thus greatly improving extraction efficiency. Furthermore, by converting the documents to be extracted into a processable text format to form the first text, subsequent processing is facilitated, improving the system's compatibility and versatility to handle various document formats. Further, the first text and the document extraction template are input into the large language model, which reads and analyzes the first text and extracts information according to the document extraction template to obtain the document extraction results corresponding to the documents to be extracted. This reduces the tedious process of manual extraction, improves extraction efficiency, and the large language model can understand complex language structures and contextual relationships, extracting information according to the document extraction template, improving the accuracy and consistency of extracted information. Therefore, the solution in this embodiment improves the accuracy and efficiency of document extraction.

[0121] The document extraction system based on a large language model provided by this invention will be described below. The document extraction system based on a large language model described below can be referred to in correspondence with the document extraction method based on a large language model described above.

[0122] Figure 3 This is a schematic diagram of the document extraction system based on a large language model provided by the present invention, as shown below. Figure 3 As shown, the above-mentioned document extraction system based on a large language model includes: a construction module 31, an acquisition module 32, and an extraction module 33.

[0123] Module 31 is used to build document extraction templates based on document extraction requirements.

[0124] The acquisition module 32 is used to acquire the documents to be extracted and convert them into a processable text format to form the first text.

[0125] The extraction module 33 is used to input the first text and the document extraction template into the large language model, apply the large language model to read and analyze the first text, and extract the first text according to the document extraction template to obtain the document extraction results corresponding to the document to be extracted.

[0126] In practical applications, there are various ways to implement a document extraction system based on a large language model. For example, it can be implemented through computer programs, such as application software; or, for example, chips. It can also be implemented as a medium storing relevant computer programs, such as USB flash drives or cloud storage; or, it can be implemented through physical devices that integrate or install relevant computer programs, such as servers or smart devices.

[0127] Specifically, module 31 is used to: construct a document extraction template based on the document extraction requirements.

[0128] The literature extraction requirements include extraction rules, format requirements, and pre-extraction fields. The pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group.

[0129] In this embodiment, extraction rules refer to specific instructions or standards used to guide the large language model in identifying and extracting specific information from documents during the document extraction process. These rules are typically formulated based on the document's structure, content characteristics, and research needs.

[0130] In this embodiment, format requirements refer to the specific specifications for the organization, presentation, and storage of information during information extraction. These requirements ensure that the extracted information has a uniform structure and format, facilitating subsequent analysis and comparison. For example, format requirements may include the format of the literature extraction results.

[0131] For example, basic research information includes, but is not limited to: the first author of the study, the year of publication, and the countries of the participants. Study design types include, but are not limited to: randomized controlled trials, cohort studies, case-control studies, cross-sectional studies, ecological studies, case reports, and case series.

[0132] In practical applications, in medical research, study groups refer to the allocation of research subjects (such as patients, subjects, samples, etc.) into different groups according to the research design in order to compare the effects of different treatments or interventions. Study grouping is a core part of experimental design, especially in research types such as randomized controlled trials, cohort studies, and case-control studies.

[0133] Specifically, the study groups were identified based on the textual information corresponding to the subheadings of the abstract, methods, and baseline information table. While fully preserving the original text, the groups were identified and extracted according to population characteristics and intervention methods. Control groups were extracted according to the original description, such as placebo and control.

[0134] For example, participant characteristics include, but are not limited to: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text. Specifically, based on the textual information corresponding to the title, abstract, and method subtitle, the participant characteristics of each group are extracted. Inclusion criteria from the references and other relevant content are also used. Common characteristics of participants in each group are extracted according to the original text description and integrated into six elements: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text.

[0135] In medical research, an intervention refers to a treatment or intervention applied to research subjects to assess its impact on health outcomes. For example, interventions for each research group may include, but are not limited to: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and route of administration.

[0136] Specifically, based on the text information corresponding to the subheadings of the abstract, methods, and baseline information table, the interventions for each study group in each group were extracted, including the drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration.

[0137] Background medication refers to the drugs or treatments used by all research groups (including experimental and control groups) during medical research, that is, drugs used simultaneously with intervention drugs and used in all groups.

[0138] Specifically, based on the text information corresponding to the method subtitle, the background medication for each study group is extracted, i.e., the medication used concurrently with the intervention drug and used in all groups. In practice, the literature extraction system based on the large language model will determine whether it is the background medication for each study group through the large language model, and output the drug name or drug type according to the original text description, such as metformin, antidiabetic drugs, etc.

[0139] Understandably, by constructing a document extraction template, the specific requirements for document extraction are clarified, including extraction rules, format requirements, and pre-extraction fields. This gives the entire extraction process a clear goal and direction. Furthermore, the construction of the template standardizes and normalizes the extraction process. Large language models can operate directly based on the template without redefining extraction rules and formats each time, thereby greatly improving extraction efficiency.

[0140] Furthermore, the acquisition module 32 is used to: acquire the document to be extracted, and convert the document to be extracted into a processable text format to form the first text.

[0141] In practical applications, the literature to be extracted can be provided by the researcher. For example, a literature extraction system based on a large language model may have an input interface or interface, allowing researchers to directly input the literature to be extracted. Another example is a literature extraction system based on a large language model that connects to a smart device, allowing researchers to send the literature to be extracted to the system via the smart device. Yet another example is a literature extraction system based on a large language model that has a human-computer interaction interface, allowing researchers to input keywords to search for relevant literature. Researchers can then select the literature to be extracted from the search results, and after selection, the system retrieves the literature and begins the extraction process.

[0142] In this embodiment, documents in various formats can be extracted. For example, the documents to be extracted can be PDF files, Word documents, HTML files, XML files, plain text files, Excel files, etc.

[0143] In this context, "processable text format" refers to a text format that facilitates subsequent natural language processing and information extraction. For example, a processable text format can be plain text, structured text (such as HTML and XML), tokenized text, JSON, etc. In practice, the processable text format can be determined based on the specific document extraction requirements and application scenarios; no specific limitations are imposed here.

[0144] Understandably, by supporting multiple document formats and converting them into a processable text format, it is possible to better adapt to different input needs and improve the flexibility and efficiency of document extraction. Acquiring the documents to be extracted and converting them into a processable text format to form the first text facilitates subsequent processing, improving the system's compatibility and versatility, enabling it to handle multiple document formats.

[0145] Specifically, the extraction module 33 is used to: input the first text and the document extraction template into the large language model, apply the large language model to read and analyze the first text, and extract the first text according to the document extraction template to obtain the document extraction result corresponding to the document to be extracted.

[0146] In practical applications, the first text and the document extraction template are input into the large language model. The large language model utilizes its powerful language understanding capabilities to read and analyze the first text, accurately identifying and extracting key information from the document, reducing errors and omissions that may occur during manual extraction. Furthermore, the large language model extracts information from the first text according to the document extraction template, yielding document extraction results that meet the required standards.

[0147] In this embodiment, the format of the literature extraction results is not specifically limited. As an example, the format of the literature extraction results includes at least one of the following: extraction table, hierarchical diagram, tree diagram, and mind map.

[0148] Optionally, in one possible implementation, based on the above embodiments, the document extraction results are an extraction table; the extraction module 33 includes:

[0149] The input unit is used to input the first text and the document extraction template into the large language model;

[0150] The segmentation unit is used to divide the first text according to preset structured rules to obtain multiple text fragments, forming the second text;

[0151] The extraction unit is used to apply a large language model to read and analyze the second text, and extract the information corresponding to the pre-extracted fields according to the extraction rules and format requirements to form the third text;

[0152] The format conversion unit is used to convert the third-party text into a table format to obtain the extraction table corresponding to the document to be extracted.

[0153] In this application, no specific limitations are made on the method of inputting the first text and the document extraction template into the large speech model. In one example, the input unit directly inputs the first text and the document extraction template into the large speech model. In another example, the input unit generates prompt words based on the first text and the document extraction template, and inputs the prompt words into the large speech model.

[0154] The pre-defined structured rules refer to the pre-defined division logic based on the common format and content structure of a document. For example, the pre-defined structured rules can be based on structural information such as titles, paragraphs, and keywords in the document. For instance, paragraphs in the document can be divided into multiple text segments, such as dividing the document into multiple text segments according to the order of the paragraphs, with each text segment consisting of 5 paragraphs.

[0155] Optionally, the documents to be extracted can be divided according to their subtitles. In some examples, the above division units are specifically used for:

[0156] The first text is divided according to the subtitles in the documents to be extracted, and the text information corresponding to each subtitle is obtained to form the second text; the subtitles include: title, abstract, background introduction, methods, and baseline information table.

[0157] Specifically, the first text is divided according to the subheadings in the document to be extracted, and the text information corresponding to each subheading is obtained to form the second text. Each subheading's text information is treated as a text segment, and each text segment corresponds to a logical part of the document.

[0158] In practical applications, the segmented text fragments are organized into a second text, with each fragment bearing a clear identifier (such as a subtitle) to facilitate subsequent processing. The second text is a structured data collection; for example, it can be stored as a list or a dictionary.

[0159] Understandably, on the one hand, through structured segmentation, the system can quickly locate the areas containing key information, reducing the time spent processing invalid information. Automated segmentation reduces the time spent on manual reading and segmentation of documents, significantly improving extraction efficiency. On the other hand, structured segmentation enables the system to extract information more accurately, reducing errors caused by unclear information locations. Each segment has a clear identifier, facilitating targeted processing in subsequent steps. Furthermore, the segmented second text is a structured data set, facilitating subsequent information extraction and analysis. Through preset structured rules, the system can adapt to documents of different formats, improving its versatility and flexibility.

[0160] Optionally, the above extraction unit is specifically used for:

[0161] Receive the second text and perform in-depth analysis on the text information corresponding to each subheading;

[0162] Using a large language model and an extraction table based on the PICO format, information corresponding to the pre-extracted fields is extracted one by one from the text information corresponding to each subtitle to form the third text.

[0163] The PICO (Population, Intervention, Comparison, Outcome) format is a widely used framework in evidence-based medicine and medical literature research. It's used to clearly define and organize the key elements of a clinical or research question. PICO is an acronym for Population, Intervention, Comparison, and Outcome. This format helps researchers and clinicians clearly define research questions, improving the efficiency and accuracy of literature retrieval, study design, and results analysis.

[0164] In practical applications, the literature extraction system based on the large language model is supplemented with an extraction table formed by extensive discussions among experts in the field of evidence-based medicine based on the PICO format. This table extracts key information from the second text, including basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group, one by one, according to the original text content.

[0165] Optionally, in one possible implementation, the extraction unit described above is used to apply a large language model and, according to an extraction table based on the PICO format, extract the information corresponding to the pre-extracted fields one by one from the text information corresponding to each subtitle. When forming the third text, it is specifically used for:

[0166] The basic research information and research design type are extracted from the text information corresponding to the title and abstract subtitle. The basic research information includes: the first author of the study, the year of publication, and the countries of the participants. The research design type includes at least one of the following: randomized controlled trial, cohort study, case-control study, cross-sectional study, ecological study, case report, and case series.

[0167] Research groups were extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables;

[0168] Participant characteristics were extracted from the textual information corresponding to the title, abstract, and method subtitle.

[0169] Interventions for each study group were extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables. The interventions for each study group included: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and route of administration.

[0170] Background medications for each study group were extracted from the text information corresponding to the subtitle of the method.

[0171] Specifically, the system extracts the first author, publication year, participant countries, and study design type from the text information corresponding to the title and abstract subtitle. If it is a randomized controlled trial (RCT), the system will further extract the experimental registration platform and registration number.

[0172] Specifically, the study groups were identified based on the textual information corresponding to the subheadings of the abstract, methods, and baseline information table. While fully preserving the original text, the groups were identified and extracted according to population characteristics and intervention methods. Control groups were extracted according to the original description, such as placebo and control.

[0173] Specifically, based on the textual information corresponding to the title, abstract, and method subtitle, the characteristics of participants in each group were extracted, along with the inclusion criteria in the references and other relevant content. The common characteristics of participants in each group were extracted according to the original text description and integrated into six elements: age, health status, common surgical or drug treatments, occupation, gender, and other population characteristics explicitly stated in the text.

[0174] Specifically, based on the text information corresponding to the subheadings of the abstract, methods, and baseline information table, the interventions for each study group in each group were extracted, including the drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration.

[0175] Specifically, based on the text information corresponding to the method subtitle, the background medication for each study group is extracted, i.e., the medication used concurrently with the intervention drug and used in all groups. In practice, the literature extraction system based on the large language model will determine whether it is the background medication for each study group through the large language model, and output the drug name or drug type according to the original text description, such as metformin, antidiabetic drugs, etc.

[0176] Understandably, by applying a large language model to read and analyze the second text, and extracting information corresponding to preset fields according to extraction rules and format requirements to form the third text, the tedious process of manual extraction is reduced, and the extraction efficiency is improved. Furthermore, the large language model can understand complex language structures and contextual relationships, and extract information according to preset rules, thereby improving the accuracy and consistency of information extraction.

[0177] Optionally, in some possible implementations, the above-described study design type is a randomized controlled trial, and the extraction unit is further used for:

[0178] The experimental registration platform and registration number are extracted from the text information corresponding to the method subtitle.

[0179] Furthermore, the format conversion unit is used to convert the third text into a table format to obtain the extraction table corresponding to the document to be extracted.

[0180] It is understandable that converting the third-party text into a tabular format to obtain and store the extraction table corresponding to the document to be extracted facilitates subsequent analysis and comparison, improves the readability and usability of the information, and facilitates further research by researchers. Therefore, the solution in this embodiment improves the accuracy and efficiency of document extraction.

[0181] It should be noted that third-party text (i.e., the extracted information collection) can be stored not only in tabular form but also in various other ways, depending on subsequent usage requirements and application scenarios. For example, third-party text can be stored in JSON format, databases (such as MySQL and MongoDB), XML format, etc.

[0182] Furthermore, to further improve the accuracy of document extraction, the extraction results can be standardized. In one possible implementation, the system further includes:

[0183] The normalization module is used to normalize the literature extraction results. The normalization process includes at least one of the following: deleting useless descriptive information, standardizing information format, and adding marks to information that cannot be extracted.

[0184] The storage module is used to store the normalized document extraction results.

[0185] In practical applications, removing useless descriptive information makes the extracted results more concise and clear, avoids interference from irrelevant information in subsequent analysis, improves data quality, reduces the complexity of subsequent processing, and saves time and computing resources. For example, after extracting the intervention for each study group within each group, the information is normalized by removing symbols such as "®" and "TM" from product names.

[0186] Understandably, a standardized information format ensures consistent formatting of extracted information, facilitating subsequent comparisons and analyses. Furthermore, a standardized format makes the extracted results easier to understand and use. For example, after extracting the intervention for each study group within each subgroup, the dosage and dosing frequency are summarized in a "dosage / time" format, such as 0.1 mg / day; the dosing time is simplified to numbers and units, such as 14 days; and the route of administration is summarized as inhalation, oral, or injection.

[0187] Understandably, marking information that cannot be extracted clearly helps researchers understand the completeness of the data, providing a reference for subsequent research or data supplementation and preventing the omission of important information. Furthermore, marking missing information improves the reliability and credibility of the data. For example, checking the information corresponding to the pre-extracted fields and outputting "NA" if the relevant information is not mentioned in the original text.

[0188] Optionally, normalization may also include removing redundant information, data cleaning, standardizing terminology, unifying date and numerical formats, unifying logical structures, and marking uncertain information, etc., without specific limitations.

[0189] In practical applications, researchers often need to study multiple research papers. The method of this invention can extract information from multiple research papers in batches. Optionally, in one possible implementation, there are multiple papers to be extracted; the system further includes:

[0190] The processing module is used to determine whether all documents to be extracted have been extracted. If all documents to be extracted have been extracted, the extraction results corresponding to all documents to be extracted are integrated and output. Otherwise, the process returns to the steps described above for obtaining documents to be extracted.

[0191] This implementation supports the extraction of data from multiple research articles, ensuring that all articles undergo a complete extraction process and integrating the results into a comprehensive bibliographic information table. This approach not only improves the completeness and efficiency of data processing but also provides high-quality data support for subsequent comprehensive analysis. Through automated judgment and iterative processing, the system can efficiently process a large number of articles, reducing human error and improving data reliability and consistency.

[0192] The document extraction system based on a large language model provided in this embodiment has a construction module that clarifies the specific requirements for document extraction by building a document extraction template, including extraction rules, format requirements, and pre-extraction fields. This gives the entire extraction process a clear goal and direction, and the template construction standardizes and normalizes the extraction process. The large language model can operate directly according to the template without redefining extraction rules and formats each time, thus greatly improving extraction efficiency. Furthermore, the acquisition module converts the documents to be extracted into a processable text format to form the first text, facilitating subsequent processing and improving the system's compatibility and versatility, enabling it to handle multiple document formats. The extraction module inputs the first text and the document extraction template into the large language model, applies the large language model to read and analyze the first text, and extracts the document according to the document extraction template to obtain the document extraction results corresponding to the documents to be extracted. This reduces the tedious process of manual extraction, improves extraction efficiency, and the large language model can understand complex language structures and contextual relationships, enabling it to extract information according to the document extraction template, improving the accuracy and consistency of the extracted information. Therefore, the solution in this embodiment improves the accuracy and efficiency of document extraction.

[0193] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 4As shown, the electronic device may include: a processor 410, a communications interface 420, a memory 430, and a communication bus 440. The processor 410, communications interface 420, and memory 430 communicate with each other via the communication bus 440. The processor 410 can call logical instructions in the memory 430 to execute a literature extraction method based on a large language model. This method includes: constructing a literature extraction template according to literature extraction requirements; wherein the literature extraction requirements include extraction rules, format requirements, and pre-extraction fields, and the pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medication for each research group; acquiring the literature to be extracted and converting it into a processable text format to form a first text; inputting the first text and the literature extraction template into the large language model, applying the large language model to read and analyze the first text, and extracting the literature according to the literature extraction template to obtain the literature extraction results corresponding to the literature to be extracted.

[0194] Furthermore, the logical instructions in the aforementioned memory 430 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0195] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the literature extraction method based on the large language model provided by the above methods. The method includes: constructing a literature extraction template according to the literature extraction requirements; wherein, the literature extraction requirements include extraction rules, format requirements, and pre-extraction fields, and the pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medication for each research group; obtaining the literature to be extracted and converting the literature to be extracted into a processable text format to form a first text; inputting the first text and the literature extraction template into the large language model, applying the large language model to read and analyze the first text, and extracting the first text according to the literature extraction template to obtain the literature extraction result corresponding to the literature to be extracted.

[0196] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the document extraction method based on a large language model provided by the above methods. The method includes: constructing a document extraction template according to document extraction requirements; wherein the document extraction requirements include extraction rules, format requirements, and pre-extraction fields, and the pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group; acquiring the documents to be extracted and converting them into a processable text format to form a first text; inputting the first text and the document extraction template into a large language model, applying the large language model to read and analyze the first text, and extracting the first text according to the document extraction template to obtain the document extraction result corresponding to the documents to be extracted.

[0197] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0198] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0199] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A document extraction method based on a large language model, characterized in that, include: Based on the literature extraction requirements, a literature extraction template is constructed; wherein, the literature extraction requirements include extraction rules, format requirements and pre-extraction fields, and the pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group; Obtain the documents to be extracted and convert them into a processable text format to form the first text; The first text and the document extraction template are input into a large language model. The first text is divided into multiple text segments according to preset structured rules to form a second text. Each text segment is text information corresponding to a subheading. The second text is received, and the text information corresponding to each subheading is analyzed in depth. The large language model is applied, and information corresponding to pre-extraction fields is extracted one by one from the text information corresponding to each subheading according to the PICO format extraction table to form a third text. The third text is converted into a table format to obtain the extraction table corresponding to the document to be extracted as the document extraction result. The literature extraction results include the study groups extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables, and the background medications for each study group extracted from the text information corresponding to the subheadings of the methods tables; the background medications are drugs or treatments used in conjunction with the intervention drugs that are common to all corresponding study groups. After obtaining the literature extraction results corresponding to the literature to be extracted, the method further includes: standardizing the literature extraction results, the standardization process including at least one of the following: deleting useless descriptive information, unifying the information format, and adding a mark to information that cannot be extracted, the unifying the information format including a unified format that summarizes dosage and administration frequency into dosage / time; storing the standardized literature extraction results; The application of the large language model, based on the PICO format extraction table, extracts information corresponding to pre-extracted fields one by one from the text information corresponding to each subtitle, forming the third text, including: extracting basic research information and research design type from the text information corresponding to the title and abstract subtitles; wherein, the basic research information includes: the first author of the study, the year of publication, and the country of the participants; the research design type includes at least one of the following: randomized controlled trial, cohort study, case-control experiment, cross-sectional study, ecological study, case report, and case series; extracting research groups from the text information corresponding to the subtitles of the abstract, methods, and baseline information tables; extracting participant characteristics from the text information corresponding to the title, abstract, and methods subtitles; extracting intervention measures for each research group from the text information corresponding to the abstract, methods, and baseline information tables; the intervention measures for each research group include: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration; and extracting background medications for each research group from the text information corresponding to the methods subtitle.

2. The document extraction method based on a large language model according to claim 1, characterized in that, There are multiple documents to be extracted; after obtaining the document extraction results corresponding to the documents to be extracted, the method further includes: Determine whether all documents to be extracted have been extracted; if all documents to be extracted have been extracted, integrate and output the extraction results corresponding to all documents to be extracted; otherwise, return to the step of obtaining documents to be extracted.

3. A document extraction system based on a large language model, characterized in that, include: The construction module is used to build a literature extraction template based on the literature extraction requirements. The literature extraction requirements include extraction rules, format requirements, and pre-extraction fields. The pre-extraction fields include: basic research information, research design type, research groups, participant characteristics, intervention measures for each research group, and background medications for each research group. The acquisition module is used to acquire the documents to be extracted and convert the documents to be extracted into a processable text format to form the first text; The extraction module is used to input the first text and the document extraction template into a large language model, divide the first text according to preset structured rules to obtain multiple text fragments, forming a second text; each of the multiple text fragments is text information corresponding to a subtitle; receive the second text and perform in-depth analysis on the text information corresponding to each subtitle; apply the large language model, according to the extraction table based on the PICO format, extract the information corresponding to the pre-extraction fields one by one from the text information corresponding to each subtitle to form a third text; convert the third text into a tabular form to obtain the extraction table corresponding to the document to be extracted as the document extraction result; The literature extraction results include the study groups extracted from the text information corresponding to the subheadings of the abstract, methods, and baseline information tables, and the background medications for each study group extracted from the text information corresponding to the subheadings of the methods tables; the background medications are drugs or treatments used in conjunction with the intervention drugs that are common to all corresponding study groups. The document extraction system is used to standardize the document extraction results. The standardization process includes at least one of the following: deleting useless descriptive information, unifying the information format, and adding a mark to information that cannot be extracted. The unified information format includes a unified format that summarizes dosage and administration frequency into dosage / time. The document extraction system is also used to store the document extraction results after normalization. The extraction module is further configured to extract basic research information and research design type from the text information corresponding to the title and abstract subtitles; wherein, the basic research information includes: the first author of the study, the year of publication, and the country of the participants; the research design type includes at least one of the following: randomized controlled trial, cohort study, case-control study, cross-sectional study, ecological study, case report, and case series; extract research groups from the text information corresponding to the subtitles of the abstract, methods, and baseline information table; extract participant characteristics from the text information corresponding to the subtitles of the title, abstract, and methods; extract intervention measures for each research group from the text information corresponding to the subtitles of the abstract, methods, and baseline information table; the intervention measures for each research group include: drug brand name, manufacturer, dosage and frequency of administration, time of administration, and method of administration; and extract background medications for each research group from the text information corresponding to the subtitle of the methods.

4. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the document extraction method based on a large language model as described in any one of claims 1 to 2.

5. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the document extraction method based on a large language model as described in any one of claims 1 to 2.