Document information extraction method and related product of AIAgent combining RPA, AI and LLM
The AI Agent automatically identifies target fields in documents and generates prompts and configuration parameters, solving the problem of low efficiency in manually writing prompts and achieving efficient and reliable extraction of unstructured document information.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING BENYING NETWORK TECH CO LTD
- Filing Date
- 2025-11-19
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, relying on manually written prompts to extract information from unstructured documents is inefficient and of inconsistent quality, and cannot quickly respond to massive and varied document types and extraction needs.
The AI Agent automatically identifies the target fields of the document to be processed, generates corresponding prompts and matching configuration parameters, and finally extracts metadata according to the configuration parameters, reducing the reliance on manually written prompts.
It significantly lowers the technical barrier to entry, improves the convenience and efficiency of information extraction, ensures the consistency and reliability of extraction results, and can quickly respond to changing document types and extraction needs.
Smart Images

Figure CN121542445B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this application relate to the field of information processing technology, and in particular to a document information extraction method and related products that combine RPA, AI, and LLM to realize AIAgent. Background Technology
[0002] Robotic Process Automation (RPA) uses specific "robot software" to simulate human operations on a computer and automatically execute process tasks according to rules.
[0003] Artificial intelligence (AI) is a technical science that studies and develops theories, methods, technologies, and application systems for simulating, extending, and expanding human intelligence.
[0004] Large Language Models (LLMs) are models trained on massive amounts of text that can recognize human language, perform language-related tasks, and have a large number of parameters.
[0005] Artificial Intelligence Agents (AI Agents) are capable of perceiving their environment, making decisions, and executing actions. Unlike traditional AI, they possess the ability to think and act independently, and can utilize tools to achieve given goals. AI Agents are based on Large Language Models (LLMs) as their core computing engine, enabling them to engage in dialogue, perform tasks, reason, and exhibit a degree of autonomy. They possess the ability to autonomously understand, perceive, plan, remember, and use tools, and can automate complex tasks. Specifically, LLM-driven AI Agents, composed of various AI capabilities, can interact with employees using natural language, understand their instructions and needs, and provide feedback and responses; they can acquire domain-specific knowledge relevant to the business to complete complex professional tasks; they can break down complex tasks into several executable tasks and use data and tools to complete them; they can also collaborate with employees, and AI Agents can collaborate with each other to complete complex tasks, enabling digital employees to leap from automation to intelligence, helping employees complete their work more efficiently, and fully realizing human-machine collaboration.
[0006] In the process of handling business, it is often necessary to extract various types of information from a large number of documents. However, these documents not only contain structured documents from which information can be directly extracted, but also a large number of unstructured documents, such as contracts, reports, invoices, and papers.
[0007] In the process of extracting information from unstructured documents, it is usually necessary to use a Large Language Model (LLM). When using an LLM for extraction, the information extraction method often relies on pre-designed cue words. These cue words can provide the LLM with the requirements for extracting information, thereby guiding the LLM to accurately extract the required information.
[0008] However, prompts often need to be written by professionals in the field, and the accuracy and reliability of the prompts will affect whether the LLM can accurately extract the required information. Therefore, writing accurate and reliable prompts requires professionals in the field to have both domain expertise and proficient prompting engineering skills, which creates a high barrier to entry in practical applications. Furthermore, in the face of massive and varied document types and extraction requirements, relying on professionals in the field to write and debug prompts is not only inefficient and unable to respond quickly to requirements, but also raises questions about the accuracy and reliability of the prompts. Summary of the Invention
[0009] In view of this, embodiments of this disclosure propose a document information extraction method and related products for an AI Agent that combines RPA, AI, and LLM.
[0010] In a first aspect, embodiments of this disclosure provide a method for extracting document information using an AI Agent by combining RPA, AI, and LLM, executed by the AI Agent. The method includes:
[0011] Identify at least one target field in the document to be processed, where each target field is the metadata name of the target information in the document to be processed.
[0012] Generate prompts pointing to target information using each target field;
[0013] Generate configuration parameters that match the prompt words;
[0014] Extract various metadata of target information from the document to be processed according to the configuration parameters.
[0015] In some alternative implementations, identifying at least one target field in the document to be processed includes:
[0016] The first preset language model, LLM, is invoked to identify the document type and corresponding layout features of the document to be processed. The document type represents the content topic classification of the document to be processed, and the layout features represent the content layout of the document to be processed.
[0017] The first LLM is used to identify the fields at each location corresponding to the layout features in the document to be processed;
[0018] The first LLM is used to identify the target fields associated with the document type from the various fields.
[0019] Accordingly, before calling the preset first major language model (LLM) to identify the document type and corresponding layout features of the document to be processed, the following can also be performed:
[0020] If the document to be processed is in image format, the preset Visual Language Model (VLM) is invoked to convert the image-formatted document into a text format for use in the first LLM.
[0021] In some optional implementations, prompts pointing to target information are generated using each target field, including:
[0022] Drive Robotic Process Automation (RPA) to input each target field into the first LLM;
[0023] Using the first LLM, each target field is combined into a structured semantic description according to the preset semantic specification, and used as prompt words.
[0024] In some optional implementations, generating configuration parameters that match the prompt words includes:
[0025] Using a pre-defined Natural Language Processing (NLP) service, each target field in the prompt words is semantically matched with each pre-defined configuration template to obtain the corresponding semantic matching degree.
[0026] The configuration template with the highest semantic match to each target field is determined as the configuration parameter that matches the prompt word.
[0027] In some alternative implementations, the configuration parameters include at least one rule for extracting metadata;
[0028] Accordingly, configuration parameters that match the prompt words are generated, including:
[0029] Based on the target fields in the prompt words, construct rule generation instructions that point to each target field;
[0030] Invoke the preset second LLM, set at least one rule for each target field pointed to by the rule generation instruction to extract the corresponding metadata, and use it as the configuration parameter for the corresponding matching.
[0031] In some optional implementations, the configuration parameters include at least output rules, location rules, and extraction specifications; and
[0032] Accordingly, at least one rule is set for each target field pointed to by the rule generation instruction to extract the corresponding metadata, including:
[0033] According to the preset data format, set the target data format for each target field to output the corresponding metadata, and use it as the corresponding output rule;
[0034] Based on the layout characteristics, set the target location for each target field to locate the corresponding metadata, and use it as the corresponding location rule;
[0035] Set extraction specifications for each target field to unify the extraction actions of each metadata.
[0036] In some optional implementations, metadata related to the target information is extracted from the document to be processed according to configuration parameters, including:
[0037] The pre-defined third LLM is invoked, and the third LLM is used to extract the metadata corresponding to each target field from the document to be processed according to the extraction specifications and each positioning rule;
[0038] The structured metadata is output according to the output rules corresponding to each target field, thus obtaining the target information containing each metadata.
[0039] In some alternative implementations, after extracting the target information's metadata from the document to be processed according to the configuration parameters, the following can also be performed:
[0040] Based on artificial intelligence (AI) technology, determine whether the metadata in the target information is complete and accurate;
[0041] In response to the determination that any item in the target information's metadata is incomplete and / or inaccurate, the first LLM for generating prompt words is adjusted; and / or
[0042] Adjust the natural language processing service that generates configuration parameters or the second LLM that generates configuration parameters.
[0043] Secondly, embodiments of this disclosure provide a document information extraction device that combines RPA, AI, and LLM to implement an AI Agent. The device includes:
[0044] The recognition module is configured to recognize at least one target field in the document to be processed, where each target field is the metadata name of the target information in the document to be processed.
[0045] The prompt word generation module is configured to generate prompt words pointing to target information using each target field;
[0046] The configuration parameter generation module is configured to generate configuration parameters that match the prompt words;
[0047] The target information extraction module is configured to extract various metadata of target information from the document to be processed according to the configuration parameters.
[0048] In some optional implementations, the identification module is further configured to:
[0049] The first preset language model, LLM, is invoked to identify the document type and corresponding layout features of the document to be processed. The document type represents the content topic classification of the document to be processed, and the layout features represent the content layout of the document to be processed.
[0050] The first LLM is used to identify the fields at each location corresponding to the layout features in the document to be processed;
[0051] The first LLM is used to identify the target fields associated with the document type from the various fields.
[0052] Accordingly, before calling the preset first large language model LLM to identify the document type and corresponding layout features of the document to be processed, the recognition module also performs:
[0053] If the document to be processed is in image format, the preset Visual Language Model (VLM) is invoked to convert the image-formatted document into a text format for use in the first LLM.
[0054] In some optional implementations, the prompt word generation module is further configured to:
[0055] Drive Robotic Process Automation (RPA) to input each target field into the first LLM;
[0056] Using the first LLM, each target field is combined into a structured semantic description according to the preset semantic specification, and used as prompt words.
[0057] In some optional implementations, the parameter generation module is further configured to:
[0058] Using a pre-defined Natural Language Processing (NLP) service, each target field in the prompt words is semantically matched with each pre-defined configuration template to obtain the corresponding semantic matching degree.
[0059] The configuration template with the highest semantic match to each target field is determined as the configuration parameter that matches the prompt word.
[0060] In some alternative implementations, the configuration parameters include at least one rule for extracting metadata;
[0061] Accordingly, the parameter generation module is further configured as follows:
[0062] Based on the target fields in the prompt words, construct rule generation instructions that point to each target field;
[0063] Invoke the preset second LLM, set at least one rule for each target field pointed to by the rule generation instruction to extract the corresponding metadata, and use it as the configuration parameter for the corresponding matching.
[0064] The configuration parameters include at least the output rules, the location rules, and the extraction specifications;
[0065] Accordingly, at least one rule is set for each target field pointed to by the rule generation instruction to extract the corresponding metadata, including:
[0066] According to the preset data format, set the target data format for each target field to output the corresponding metadata, and use it as the corresponding output rule;
[0067] Based on the layout characteristics, set the target location for each target field to locate the corresponding metadata, and use it as the corresponding location rule;
[0068] Set extraction specifications for each target field to unify the extraction actions of each metadata.
[0069] In some optional implementations, the target information extraction module is further configured to:
[0070] The pre-defined third LLM is invoked, and the third LLM is used to extract the metadata corresponding to each target field from the document to be processed according to the extraction specifications and each positioning rule;
[0071] The structured metadata is output according to the output rules corresponding to each target field, thus obtaining the target information containing each metadata.
[0072] Accordingly, after extracting the target information's metadata from the document to be processed according to the configuration parameters, the target information extraction module can execute:
[0073] Based on artificial intelligence (AI) technology, determine whether the metadata in the target information is complete and accurate;
[0074] In response to the determination that any item in the target information's metadata is incomplete and / or inaccurate, the first LLM for generating prompt words is adjusted; and / or
[0075] Adjust the natural language processing service that generates configuration parameters or the second LLM that generates configuration parameters.
[0076] Thirdly, embodiments of this disclosure provide an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which, when executed by the one or more processors, cause the one or more processors to implement the method described in any implementation of the first aspect.
[0077] Fourthly, embodiments of this disclosure provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any implementation of the first aspect.
[0078] Fifthly, embodiments of this disclosure provide a computer program product, including a computer program / instructions that, when executed by a processor, implement the method described in any of the implementations of the first aspect.
[0079] To address the inefficiency and inconsistent quality of manually written prompts, this disclosure provides a document information extraction method and related products that combine RPA, AI, and LLM to implement an AI Agent. The AI Agent automatically identifies target fields in the document to be processed, generates corresponding prompts and matching configuration parameters, and finally extracts metadata based on the configuration parameters. This significantly reduces the reliance on manually written prompts for document information extraction. No professional personnel with both domain knowledge and prompting engineering skills are required; ordinary users can extract information from various types of documents, including unstructured documents, effectively lowering the technical barrier to entry and improving the convenience of the extraction operation.
[0080] Meanwhile, the entire extraction process, from target field identification to prompt word generation, configuration parameter matching, and metadata extraction, does not require manual intervention and debugging at each step. As a result, it can quickly respond to and efficiently complete information extraction tasks when faced with massive and varied document types and extraction requirements, greatly improving extraction efficiency and scalable processing capabilities. Furthermore, the accurate matching of configuration parameters ensures the consistency and reliability of the extraction results. Attached Figure Description
[0081] Other features, objects, and advantages of this disclosure will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings. The drawings are for illustrative purposes only and are not intended to limit the invention. In the drawings:
[0082] Figure 1 This is an exemplary system architecture diagram to which one embodiment of this disclosure may be applied;
[0083] Figure 2A A flowchart of an embodiment of a document information extraction method for an AI Agent that combines RPA, AI, and LLM according to this disclosure;
[0084] Figure 2B This is a flowchart illustrating an embodiment of the decomposition process 2010 according to one embodiment of step 201 of this disclosure;
[0085] Figure 2CThis is a breakdown flowchart of an embodiment of the breakdown process 2030 according to step 203 of this disclosure;
[0086] Figure 3 The execution flow diagram of the execution flow 300 is a specific example of the document information extraction method of AI Agent that combines RPA, AI and LLM according to this disclosure;
[0087] Figure 4 This is a schematic diagram of a document information extraction device that combines RPA, AI, and LLM to implement an AI Agent according to an embodiment of the present disclosure;
[0088] Figure 5 This is a schematic diagram of the structure of a computer system suitable for implementing embodiments of the present disclosure, which combines RPA, AI, and LLM to realize an AI Agent for document information extraction. Detailed Implementation
[0089] The present disclosure will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0090] In the description of this disclosure, the term "multiple" means two or more.
[0091] In the description of this disclosure, the term "hint word" refers to a descriptive statement or semantic text that includes natural language or structured instructions, used to guide the LLM in extracting target information from a document to be processed during the process of using a Large Language Model (LLM). Its core function is to transform the requirement of what target information to extract from the document to be processed into a descriptive statement or semantic text that the LLM can understand, in which the target information fields are clearly defined.
[0092] In the description of this disclosure, the term "document type" refers to a classification identifier formed based on the content theme, business purpose and information organization logic of the target document. The document type can be determined by at least one semantic content carried by the target document. For example, the target document is identified as a purchase order by the "order number" and / or "purchase list" in the target document, and the target document is identified as a bill by the "amount" and / or "invoice date" in the target document.
[0093] In the description of this disclosure, the term "layout features" refers to the structural attributes of a target document as it is visually presented, reflecting the content layout, element positions, and formatting specifications of the target document, and can be identified by a Vision Language Model (VLM).
[0094] In the description of this disclosure, the term "metadata" refers to the specific data or content that makes up the target information. Each piece of metadata has a metadata name that identifies the category of the corresponding metadata. For example, "order date" is the metadata name, and the corresponding metadata is "2025-11-6".
[0095] In the description of this disclosure, the Vision Language Model (VLM) is a cross-modal model that integrates computer vision technology and natural language processing (NLP) technology. It can simultaneously understand the semantic and visual features of a document, such as layout features, spatial location of visual elements, document attributes, and semantic content of text, and establish a correlation mapping between the two. That is, the content identified by computer vision technology is transformed into structured data or semantic data suitable for LLM.
[0096] In the description of this disclosure, NLP technology is an interdisciplinary technology that integrates computer science, linguistics, and artificial intelligence. It can use computers to understand, analyze, and process human natural language, such as text content, business terms, and semantic logic in documents, and transform natural language text into structured information.
[0097] The implementation of the Digital Employee Platform (WEP) has gone through three stages. The first stage is automation: targeting RPA (Robotic Process Automation) with low business complexity, it utilizes software automation technology to automate rule-based, predefined procedural tasks. The second stage is intelligence: leveraging AI to extend the boundaries of RPA, such as processing unstructured documents and making data-driven decisions. The third stage is human-machine collaboration: utilizing the understanding, planning, and execution capabilities of large-scale models to automate complex tasks end-to-end.
[0098] In the human-machine collaboration phase, the digital employee platform serves as a bridge connecting workers and systems, workers and data, and systems and data. It is capable of: operating complex systems, processing various types of data, and interacting and collaborating with employees. The digital employee platform helps industries build large-scale, model-enabled digital employees (i.e., intelligent agents), achieving automation, intelligence, and human-machine collaboration in business processes.
[0099] The digital employee platform can seamlessly integrate multiple capabilities such as Agentic Process Automation (APA), Agentic Document Processing (ADP), and Agentic Business Insights (ABI). It has five major functions: "business understanding", "process creation", "run anywhere", "centralized management and control" and "human-machine collaboration". It enables enterprises to achieve end-to-end intelligent automation of business processes, replace manual operations, further improve business efficiency, and accelerate digital transformation.
[0100] ADP is a next-generation platform based on LLM and VLM, combined with AI Agent technology, that enables end-to-end automated document processing. It represents a new generation of document processing solutions using LLM and AI Agent technologies. It is no longer a "tool" that requires configuring templates or annotating samples, but rather an "intelligent agent" capable of understanding business needs and autonomously planning and executing. Traditional document processing systems are "tools": users need to explicitly tell the system "how to do it." ADP, on the other hand, is an "intelligent agent": users only need to tell the system "what to do," and the system can autonomously understand, plan, and execute.
[0101] It should be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other. This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0102] Figure 1 An exemplary system architecture 100 is shown, which can be applied to the document information extraction method and related products of the present disclosure that combine RPA, AI, and LLM to implement an AI Agent.
[0103] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.
[0104] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various artificial intelligence (AI) based models and services can be installed on terminal devices 101, 102, and 103, such as VLM, LLM, RPA services, and NLP services. These AI-based models and services collectively constitute an artificial intelligence agent (AI Agent) capable of performing predetermined tasks.
[0105] Terminal devices 101, 102, and 103 can be hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with visual acquisition devices, document reading devices, and displays to receive input documents to be processed. Terminal devices 101, 102, and 103 include, but are not limited to, smartphones, tablets, e-book readers, laptops, and desktop computers. When terminal devices 101, 102, and 103 are hardware, each terminal device can independently carry a WEP, on which an AI Agent composed of the aforementioned AI-based models and services is deployed. Alternatively, each terminal device can carry one or more of the aforementioned AI-based models and / or services, and the AI Agent can be composed of the AI-based models and services carried by each of terminal devices 101, 102, and 103.
[0106] When terminal devices 101, 102, and 103 are software, they can be installed on the terminal devices listed above and used to receive input documents to be processed. They can be implemented as multiple software programs or software modules (e.g., to carry an AI Agent composed of the aforementioned models and services). That is, each terminal device as software can act as a WEP (Web Application Program) to deploy a complete AI Agent, or it can be implemented as a single software program or software module. In other words, each terminal device as software can act as a module composed of one or more AI-based models and / or services. No specific limitations are made here.
[0107] In some cases, the document information extraction method for implementing an AI Agent by combining RPA, AI, and LLM provided in this disclosure can be executed individually by terminal devices 101, 102, and 103, or it can be executed jointly by terminal devices 101, 102, and 103. Accordingly, the document information extraction device for implementing an AI Agent by combining RPA, AI, and LLM can be set in terminal devices 101, 102, and 103. In this case, system architecture 100 may not include server 105.
[0108] In some cases, the document information extraction method for implementing an AI Agent by combining RPA, AI, and LLM provided in this disclosure can be jointly executed by terminal devices 101, 102, and 103 and server 105. For example, the step of "generating prompt words pointing to target information using each target field" can be executed by terminal devices 101, 102, or 103, and the steps of "extracting various metadata of target information from the document to be processed according to configuration parameters" can be executed by server 105. This disclosure does not limit this. Correspondingly, the document information extraction device for implementing an AI Agent by combining RPA, AI, and LLM can also be respectively set in terminal devices 101, 102, and 103 and server 105.
[0109] In some cases, the document information extraction method for AI Agent that combines RPA, AI, and LLM provided in this disclosure can be executed by server 105 alone, and the execution results of server 105 can be displayed through terminal devices 101, 102, and 103. In this case, server 105 carries WEP, on which AI Agent composed of the above-mentioned models and services is deployed, and can receive input documents to be processed. Correspondingly, terminal devices 101, 102, and 103 can also be set in server 105. In this case, system architecture 100 may not include terminal devices 101, 102, and 103.
[0110] It should be noted that server 105 can be either hardware or software. When server 105 is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, each of which can receive input documents to be processed; alternatively, it can be implemented as a single server. When server 105 is software, it can be implemented as multiple software programs or software modules (for example, to provide the aforementioned AI-based models and services, or to provide an AI Agent composed of the aforementioned models and services); alternatively, it can be implemented as a single software program or software module. No specific limitations are imposed here.
[0111] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0112] In some alternative implementations, refer to Figure 2A The document information extraction method for an AI Agent combining RPA, AI, and LLM, as disclosed herein, is illustrated in flowchart 200. This method is executed by an AI Agent deployed in WEP and includes the following steps 201 to 204:
[0113] Step 201: Identify at least one target field in the document to be processed.
[0114] The document to be processed is the document from which the target information is to be extracted. It can be a structured document, such as a data table in a database, or an unstructured document, such as a contract, report, or invoice. When the document to be processed is an unstructured document, it can be a text document or an image document.
[0115] The target field can be the metadata name of each metadata item in the target information to be extracted.
[0116] Based on the aforementioned application scenarios, since the documents to be processed may be presented in both text and image formats, and VLM integrates computer vision technology and NLP technology, it has the characteristic of being able to understand the semantic and visual features of documents simultaneously. Accordingly, the AI Agent deployed in WEP can call VLM and input the documents to be processed into VLM for parsing.
[0117] During the parsing process of the document to be processed by VLM, when the document to be processed is in image format, VLM can identify the visual elements in the document to be processed and extract the text content contained therein. Thus, the text content can be used to transform the image format document to be processed into a text format that is suitable for LLM.
[0118] VLM can clean the extracted text content to remove text content that contains obvious errors, repetitions, or missing substantive data.
[0119] During the parsing process of the VLM document, if the document is too long, the VLM can split it into multiple document fragments according to specific needs. In the following steps, each document fragment is used as the actual document to be processed, replacing the original document. This makes the AIAgent perform better and more efficiently in the subsequent steps.
[0120] In addition, VLM can also process other document formats, such as PDF (Portable Document Format), Excel, and Word documents.
[0121] Based on the text-formatted document output by the VLM, the AI Agent deployed in WEP can call the first LLM and drive the preset RPA service. The RPA service can input the text-formatted document into the first LLM according to the preset process.
[0122] Furthermore, by running the first LLM, various document attribute information of the document to be processed can be identified, the document or text content to be processed with document attribute information can be obtained, and the required target fields can be extracted from the document or text content to be processed, that is, the metadata names of the required metadata can be extracted, where the number of target fields can be one or more.
[0123] Step 202: Generate prompt words pointing to target information using each target field.
[0124] Based on the target fields determined in step 201 above, the AI Agent deployed in WEP can use the first LLM to generate prompt words applicable to the document to be processed by utilizing each target field.
[0125] Specifically, after the AI Agent deployed in WEP starts the RPA service, the RPA service can input the target field into the first LLM according to the preset process. The first LLM in this step can be the same LLM as the first LLM in step 201 above. For example, the first LLM may contain network structures with different functions. That is, the RPA inputs the target field into the network structure in the first LLM that performs the function of generating prompt words, or the first LLM outputs the target field as the first LLM and then uses the first LLM as the LLM for generating prompt words and inputs the target field into the first LLM. In another case, the first LLM in this step and the first LLM in step 201 above can also be two different LLMs.
[0126] Furthermore, since the target information to be extracted includes various metadata, and each target field is the metadata name of each metadata, each target field can be regarded as an identifier that refers to each metadata before extracting each metadata.
[0127] Based on this, the first LLM can compose each target field into a structured semantic description, such as a semantic statement, and use the semantic description as a prompt word.
[0128] In the process of composing a semantic description, semantic specifications can be pre-defined, which may include explicit limitations on the content of the semantic description, the actions to be performed, and the objects to be performed on the actions.
[0129] For example, the pre-defined semantic specifications may include: requiring the target field to be reflected in the semantic description, requiring the action to be an extraction action, and requiring the extraction action to be performed on the document to be processed.
[0130] Accordingly, the first LLM can, according to the preset semantic specifications, combine each target field into a semantic description representing the above-mentioned limited content, and use it as the corresponding prompt word.
[0131] In a specific example, taking the document to be processed as a purchase order, the target fields as customer order number, order date, customer name, and purchase list, and the action to be performed as an extraction action, the first LLM can combine the customer order number, order date, customer name, and purchase list into the following prompt words:
[0132] "Extract the following key information from the document: customer order number, order date, customer name, and purchase list containing material number, material name, specifications, unit and quantity."
[0133] Step 203: Generate configuration parameters that match the prompt words.
[0134] The configuration parameters are used to guide the rules or specifications when extracting various metadata items; that is, the configuration parameters are instructions to constrain the metadata extraction actions.
[0135] Based on the prompt words determined in step 202 above, during the process of generating configuration parameters using prompt words in the AI Agent deployed in WEP, it can adapt corresponding matching configuration parameters for prompt words, or create corresponding matching configuration parameters for prompt words.
[0136] Specifically, since the prompt word contains various target fields, and each target field refers to various metadata, the characteristics of the corresponding metadata can be determined based on each target field. Thus, the extraction specifications, location rules, and output rules can be set for each metadata item based on the characteristics of each metadata item. In other words, configuration parameters matching the prompt word can be obtained.
[0137] In a specific example, during the process of generating the corresponding matching configuration parameters, the configuration parameters matching the prompt word can be generated by calling the preset second LLM.
[0138] In a specific example, during the process of generating corresponding matching configuration parameters through adaptation, the AI Agent deployed in WEP can call a preset database and a preset rule mapping engine to match the configuration parameters. The rule mapping engine is equipped with an NLP service for performing semantic matching.
[0139] The database includes multiple configuration templates, each of which may contain one or more rules or specifications. The rules or specifications in each configuration template may be the same as or different from those mentioned above. The configuration template only includes the category of the rules or specifications, but not the specific content of the rules or specifications.
[0140] For example, a configuration template may include requirements for setting output rules for the output format of the metadata item, but it does not include specific restrictions on what data format to use. In other words, the configuration template only indicates which aspects of the rules or specifications need to be set, but it is not instantiated.
[0141] Since the target fields in each prompt word and the rules or specifications in each configuration template are usually represented in semantic form, NLP services can be used to perform semantic matching between each target field and each configuration template, and obtain the degree of semantic matching between each target field and each matching template.
[0142] During the semantic matching process, the semantic similarity between each target field and each configuration template can be calculated separately, and the semantic similarity can be used as the degree of semantic matching between the two.
[0143] Furthermore, for each target field, based on the semantic matching degree between the target field and each configuration template, the configuration template with the highest semantic matching degree can be used as the configuration template corresponding to the target field.
[0144] Based on the configuration templates corresponding to each target field, the AI Agent deployed in WEP can call the preset database and preset NLP service, use the NLP service to instantiate according to the configuration template corresponding to each target field, and obtain the configuration parameters corresponding to each target field after instantiation.
[0145] In other cases, configuration templates include not only the categories of rules or specifications, but also the specific content of each rule or specification. After selecting the configuration template with the highest semantic matching degree using NLP services, it can be directly used as the configuration parameter corresponding to the target field.
[0146] Step 204: Extract the target information metadata from the document to be processed according to the configuration parameters.
[0147] Based on the configuration parameters corresponding to each target field determined in step 203 above, the AIAgent deployed in WEP can load the configuration parameters into a preset extraction execution engine to extract various metadata. The extraction execution engine can be, for example, an LLM. This LLM can be the same LLM or a different LLM from the first LLM and / or the second LLM in the previous step. When this LLM is a different LLM from the first LLM and / or the second LLM in the previous step, it can be used as the third LLM.
[0148] In this step, when the configuration parameters are loaded into the extraction execution engine, the document to be processed can be loaded into the extraction execution engine together, and various metadata can be extracted from the document to be processed. In some cases, the text content with various document attribute information obtained in the aforementioned step 201 can be used to replace the document to be processed, and the text content can be loaded into the extraction execution engine, and various metadata can be extracted from the text content.
[0149] In a specific example, when the extraction specifications, location rules, and output rules for the metadata are set in the configuration parameters, the extraction execution engine, i.e., the third LLM, can locate the corresponding metadata according to the location rules in the configuration parameters and perform the extraction action according to the extraction specifications in the configuration parameters. After the corresponding metadata is extracted, it is further output according to the output rules in the configuration parameters.
[0150] During the output of this metadata, the extracted metadata can be combined into structured target information and then output.
[0151] As can be seen, by automatically identifying the target fields in the document to be processed, the effect of eliminating the need for manual pre-sorting of metadata names can be achieved, providing a precise basis for subsequent prompt word generation. Generating prompt words pointing to the target information based on the target fields avoids the professional requirements of manually designing prompt words, and the generated prompt words directly point to the target information, ensuring that the extraction direction does not deviate. By matching the configuration parameters corresponding to the prompt words, the need for manual formulation of extraction rules is avoided. By extracting metadata according to configuration parameters, the extraction process can be completed without manual intervention. Therefore, in the entire process of extracting target information, no manual intervention or debugging is required at each stage, significantly reducing labor costs and improving the overall efficiency of information extraction. Furthermore, the coordinated operation of each step ensures the accuracy and reliability of information extraction.
[0152] In some alternative implementations, refer to Figure 2B This illustrates a decomposition process 2010 of one embodiment of step 201 of this disclosure. The decomposition process 2010 includes the following steps 2011 to 2013:
[0153] Step 2011: Call the preset first large language model LLM to identify the document type and corresponding layout features of the document to be processed.
[0154] During the process of identifying various target fields in the document to be processed, the AI Agent deployed in WEP can call VLM to parse the document to be processed, which can convert the identified visual elements into text content, thus obtaining the document to be processed in text format after parsing.
[0155] Furthermore, the AI Agent deployed in WEP can input the text-formatted document to be processed into the first LLM. The first LLM can then identify the document type, layout features, and text content of the document to be processed, and determine the identified document type and layout features as the document attribute information of the document to be processed.
[0156] Based on this, text content or documents to be processed containing document attribute information can be obtained.
[0157] Step 2012: Use the first LLM to identify the fields at each location corresponding to the layout features in the document to be processed.
[0158] Based on the text content with document attribute information determined in step 2011 above, the first LLM identifies each field in the text content.
[0159] Specifically, since the text content has document attribute information, and the layout features in the document attribute information can reflect the content layout, element position and format specifications of the document to be processed, that is, the layout features point to the position and layout of each field in the document to be processed.
[0160] Therefore, the first LLM can accurately and effectively determine each field from each position it points to based on the board features, and identify each field.
[0161] Step 2013: Use the first LLM to identify the target fields associated with the document type from the fields.
[0162] Since the target information to be extracted from the document to be processed is usually the key information in the document, and each key information is usually closely related to the document type in the document attribute information.
[0163] Accordingly, based on the fields determined in the aforementioned steps, the first LLM can identify fields closely related to the document type from the various fields and determine such fields as the target fields.
[0164] As can be seen, after converting the document to be processed from image format to text format, the progressive operation of first identifying the document type, then identifying document fields, and finally filtering target fields achieves accurate positioning and filtering of target fields, laying a reliable foundation for subsequent information extraction. Specifically, by calling the first LLM to identify the document type of the document to be processed, the business attributes and information scope of the document are clarified, avoiding the pitfall of subsequent field identification being without clear direction. Based on the document attribute information, the first LLM is called to identify each field, ensuring that all information units in the document are fully covered and no potential key content is missed. By using the first LLM to filter out target fields associated with the document type from each field and eliminating redundant fields that are irrelevant to the core needs of the document, the subsequent extraction actions focus only on valuable key information. This makes the entire process unnecessary for manual judgment of document type or field filtering, reducing the impact of human subjective error on field filtering results and improving the efficiency and accuracy of target field identification, ensuring that the subsequently generated prompts and configuration parameters can accurately match the actual extraction needs of the document.
[0165] In some alternative implementations, further reference is made. Figure 2C The diagram illustrates a flow 2030 of another embodiment of step 203 of this disclosure. Flow 2030 includes steps 2031 to 2032:
[0166] Step 2031: Construct rule generation instructions pointing to each target field according to the prompt words.
[0167] In the process of generating corresponding matching configuration parameters through creation, rules can be generated first.
[0168] The rule generation instruction is an instruction used to guide the second LLM to generate configuration parameters containing at least one rule for extracting metadata. In other words, the rule generation instruction can specifically provide the second LLM with an indication of the generated configuration parameters.
[0169] Therefore, since the rules contained in the configuration parameters are used to extract metadata, and the metadata corresponds to each target field, the rule generation instruction can be generated according to each target field in the prompt words.
[0170] Specifically, the rule generation instruction can be in the form of a semantic description to provide the second LLM with instructions pointing to various target fields, such as "set configuration parameters for the order number in the prompt word".
[0171] The second LLM used to generate configuration parameters can be the same LLM or a different LLM from the first LLM, first LLM and / or third LLM in the aforementioned steps.
[0172] Step 2032: Invoke the preset second LLM, set at least one rule for each target field pointed to by the rule generation instruction to extract the corresponding metadata, and use it as the configuration parameter for the corresponding matching.
[0173] Based on the rule generation instructions constructed in step 2031 above, the AI Agent deployed in WEP can call the second LLM, load the previously determined prompt words into the second LLM, and instruct the second LLM to generate configuration parameters based on the prompt words according to the rule generation instructions.
[0174] Since the rule generation instruction points to the target field that needs to set various rules and specifications, the second LLM can generate various rules and specifications for each target field in the prompt word to extract the corresponding metadata, so that the corresponding metadata can be extracted in subsequent steps according to the rules and specifications.
[0175] In a specific example, if the expected configuration parameters include the extraction specification when extracting the metadata, the location rule when locating the metadata, and the output rule when outputting the metadata, the corresponding extraction specification, location rule, and output rule can be set for each target field during the LLM configuration parameter generation process.
[0176] Specifically, in the process of setting positioning rules, since the layout features in the document attribute information can reflect the position layout of each field, the positioning rules corresponding to each target field can be determined according to the layout features, and used to locate the corresponding metadata when extracting the metadata.
[0177] During the process of setting output rules, you can set an appropriate data format for the corresponding target field and set that data format as the output rule when outputting the corresponding metadata.
[0178] In setting up extraction specifications, since the extraction specifications constrain the extraction of metadata in general, you can set up separate extraction specifications for the metadata corresponding to each target field to ensure that the extraction of metadata is effective. Alternatively, you can set up a unified extraction specification for the metadata corresponding to each target field so that the extraction action is performed with the same unified specification during the extraction of metadata.
[0179] Based on this, the second LLM is guided by rule generation instructions to generate configuration parameters for each target field, realizing the automated generation of configuration parameters. This provides a clear and suitable execution basis for subsequent metadata extraction. Specifically, by constructing rule generation instructions, the objects for the third LLM to generate configuration parameters are clearly defined. By calling the third LLM, configuration parameters matching metadata are set for each target field, ensuring that the configuration parameters corresponding to each target field can accurately match its metadata extraction requirements. This eliminates the need for manual design of configuration rules one by one, reducing manual operation costs and rule design errors. It also ensures that the generated configuration parameters are closely related to the target information in the prompt words, providing key support for subsequent extraction of metadata according to specifications and ensuring the accuracy and consistency of extraction results.
[0180] In some alternative implementations, refer to Figure 3 This illustrates the execution flow of a specific example of this disclosure. The execution flow 300 of this specific example is performed by an AI Agent deployed in WEP and includes the following steps 301 to 311.
[0181] In this specific example, a purchase order in image format (PDF) is used as a specific example of a document to be processed. In this example, each LLM can be the same LLM or different LLMs.
[0182] Once the purchase order is obtained, it can be executed. Figure 3 Step 301 in the document parsing.
[0183] In this step, VLM can be used to convert the image-format PDF document into a text-format document to be processed.
[0184] Furthermore, it can be executed Figure 3 Step 302 in the process involves generating prompt words.
[0185] In this step, LLM can be used to extract the document attribute information and text content of the text-formatted document to be processed, and the document type, layout features and text content of the purchase order can be determined after parsing.
[0186] In this step, a prompt generation instruction can be pre-built to guide the LLM in generating prompts. After inputting the obtained document type, layout features, and text content into the LLM, the prompt generation instruction is run to drive and guide the LLM to identify target fields in the document to be processed according to the document type and layout features. The identified target fields include customer order number, order date, customer name, and purchase list in the purchase order.
[0187] Based on this, according to the preset semantic specifications, the above customer order number, order date, customer name, and purchase list can be combined into a structured semantic description, resulting in the following prompt words:
[0188] "Please help me extract the customer order number, order date, customer name, and purchase order list containing material number, material name, specifications, unit, and quantity from the document."
[0189] In some cases, the prompt can be generated based on a single current purchase order or multiple purchase orders, i.e., based on multiple pending document versions of the same document type. When the prompt is generated based on multiple pending documents, the prompt has a stronger generalization ability.
[0190] Furthermore, based on the obtained prompts, configuration parameters corresponding to each target field can be generated through adaptation or creation.
[0191] When the configuration parameters are determined through adaptation, step 303 can be executed after completing step 302 to match the configuration template.
[0192] In this step, based on multiple pre-set configuration templates, NLP services can be used to determine the semantic matching degree between the above prompt words and each configuration template, and the matching template with the highest semantic matching degree can be selected.
[0193] The selected configuration template shows the following rules and specifications that can be set:
[0194] common_rules:
[0195] field_name:
[0196] field_type:
[0197] Description:
[0198] Wherein, field_name represents the applicable object of the configuration template to be populated, such as the target field, field_type represents the output rule of the target field to be populated, description represents the positioning rule of the target field to be populated, and common_rules represents the extraction specification of the target field to be populated.
[0199] As can be seen, this matching template only contains the rules and specifications that need to be set, but does not contain the specific settings of the rules and specifications.
[0200] In addition, if the metadata corresponding to the target field is output in a table format, then each column or row in the table can be used as the applicable object of the above configuration template.
[0201] In some cases, a unified extraction standard can be followed when extracting metadata corresponding to various target fields. Based on this, step 304 can be further executed to instantiate the configuration template.
[0202] In this step, based on the configuration template determined above, LLM can be used to fill in the corresponding settings for each target field according to the configuration template, thereby completing the instantiation of the configuration template.
[0203] Specifically, the instantiated configuration template can be represented as follows:
[0204] common rules: 1. Output information must be in compact JSON format, without indentation or line breaks. 2. Field names must strictly match the prompts; adding unmentioned fields is prohibited. 3. Field values must originate from explicitly identified or clearly inferred content in the documentation. 4. Table data must be extracted completely, preserving its row and column structure. 5. If field content is missing or unclear, it can be left blank, but fictitious content should not be created.
[0205] When extracting metadata corresponding to each target field using a unified extraction standard, five extraction standards applicable to each target field can be used to populate common_rules. Here, JSON represents JavaScript object notation.
[0206] field_name: Customer order number; field_type: string
[0207] Description: A unique order number provided by the purchaser, usually located at the top of the order or in the order number identifier, consisting of letters and numbers.
[0208] field_name: Order date
[0209] field_type: date
[0210] Description: Order creation date, usually located near the order number, and should be in the format YYYY-MM-DD.
[0211] field_name: Customer name
[0212] field_type: string
[0213] Description: The full name of the purchaser, usually appearing in the order header or order information field, is the officially registered company name.
[0214] field_name: Procurement List
[0215] field_type: table
[0216] table_columns: field_name: material number; field_type: string; description: unique material code provided by the supplier, usually located in the first column of the purchase list.
[0217] field_name: Material name; field_type: string; description: Name of the purchased item, usually following the material number, describing the specific product type. field_name: Specifications / Model field_type: string; description: Technical parameters or model of the material, used to distinguish different specifications of the product. field_name: Unit field_type: string; description: Unit of measurement for the purchased quantity, such as "grams", "pieces", "rolls", etc. field_name: Quantity field_type: string; description: Quantity of the purchased item, usually expressed as a number followed by the unit.
[0218] Based on setting a unified extraction standard for the metadata corresponding to each target field, when applying the configuration template to each target field, the customer order number, order date, customer name, and purchase list can be filled in for field_name respectively.
[0219] Furthermore, the field_type for customer order number is filled with a string format, indicating that the metadata corresponding to the customer order number will be output in string format; the field_type for order date is filled with a date format, indicating that the metadata corresponding to the order date will be output in date format; the field_type for customer name is filled with a string format, indicating that the metadata corresponding to the customer name will be output in string format; and the field_type for purchase list is filled with a table format, indicating that the metadata corresponding to the purchase list will be output in table format.
[0220] Furthermore, since the customer order number, order date, and customer name are not in table format, each of the customer order number, order date, and customer name configuration templates has a description field, and each description field can be filled with a description to be used to locate the corresponding metadata.
[0221] Since the purchase list corresponds to a table format, the configuration template for the purchase list uses table columns, i.e., table_columns, instead of description, and sets its own configuration template for each column in the table corresponding to the purchase list.
[0222] When filling the configuration template for each column, the column name for each column is filled in field_name, such as material number, material name, specifications, unit and quantity.
[0223] Furthermore, filling the field_type for material number with a string format indicates that when outputting metadata corresponding to the material number, it should be output in string format; filling the field_type for material name with a string format indicates that when outputting metadata corresponding to the material name, it should be output in string format; filling the field_type for specification / model with a string format indicates that when outputting metadata corresponding to the specification / model, it should be output in string format; filling the field_type for unit with a string format indicates that when outputting metadata corresponding to the specification / model, it should be output in string format; filling the field_type for quantity with a string format indicates that when outputting metadata corresponding to the quantity, it should be output in string format.
[0224] Furthermore, the description field in each column is populated with a description to be used to locate the corresponding metadata for that column.
[0225] After instantiating the configuration templates corresponding to each target field, the corresponding configuration parameters can be obtained.
[0226] In other cases, the configuration template to be matched in this step may contain both the rules and specifications that need to be set and the specific settings of the rules and specifications. In this case, steps 303 and 304 above will be replaced by: selecting the configuration template with the highest semantic matching degree with each target field from the configuration templates and directly determining it as the configuration parameter that matches the prompt word.
[0227] In another scenario, when the configuration parameters are determined through generation, step 305 can be executed after completing step 302 to construct the rule generation instruction.
[0228] In this step, a rule generation instruction can be constructed based on the prompt words, the document to be processed, or the target field to guide the LLM in generating configuration parameters. This allows the LLM to generate one or more rules for each target field to extract the corresponding metadata, i.e., the corresponding configuration parameters, when the rule generation instruction is run.
[0229] Further, step 306 can be executed to call LLM, and after calling LLM, step 307 can be executed to generate configuration parameters.
[0230] In this step, LLM can generate corresponding configuration parameters for each target field according to the prompt words. The generated configuration parameters can be the same as those generated in step 304 above.
[0231] Based on the configuration parameters generated in step 304 or step 307, step 308 can be further executed to extract the target information.
[0232] In this step, metadata corresponding to each field_name can be extracted from the purchase order or text content with document attribute information according to the location described in each description, and each metadata is output as the field_type corresponding to each field_name.
[0233] Based on this, the various metadata outputs can be combined to form target information, thereby extracting the target information.
[0234] Further, step 309 can be executed to output the target information.
[0235] In this step, based on the target information extracted in step 309 above, the target information can be output in a structured form.
[0236] Specifically, it can be output in the following structured JSON format:
[0237] { "extraction_result": [ {"field_name": "Customer Order Number","field_value":["P020250304012"],"references": ["Purchase Order Number: P020250304012"]}, {"field_name":"Order Date","field_value": ["2025-03-04"],"references": ["Order Date: 2025-03-04"]}, {"field_name": "Customer Name","field_value": ["XXXXX Co., Ltd."],"references": ["Ordering Party: XXXXX Co., Ltd."]}, {"field_name": "Purchase List","table_values": [{"Material Number": "C.FL.0003","Material Name": "Lead Solder Paste","Specification": "GW9068C-6","Unit": "gram","Quantity": "100000"},{"Material Number": "C.FL.0001","Material Name": "Solder Wire","Specification": "YF-12 φ1.1mm 55%","Unit": "pcs","Quantity": "80"}]} ]}
[0238] Based on the obtained target information, step 310 can be further executed to determine whether the expectation has been met.
[0239] In this step, the AI Agent deployed in WEP can use AI technology to perform a completeness assessment and an accuracy assessment on the extracted target information to determine whether the target information has omitted other data that needs to be extracted, and to determine whether the metadata in the target information is accurate.
[0240] Based on the judgment made in the aforementioned steps, if the target information is complete and accurate, it can be considered that the target information has met expectations, and 311 can be further executed to apply the target information. Accordingly, the target can be applied to the business.
[0241] If any one or more metadata items in the target information are incomplete and / or inaccurate, the target can be considered not to have met expectations. This judgment result can be used as a feedback signal and fed back to the LLM that generates the prompt words, so that the LLM can adjust its parameters and re-execute step 302 until the output target information meets expectations.
[0242] Alternatively, if the configuration parameters are obtained through adaptation, the feedback signal can be sent to the NLP service that matches the configuration parameters, so that the NLP can adjust the parameters and re-execute step 303 until the output target information meets expectations.
[0243] When configuration parameters are obtained through generation, the feedback signal can also be fed back to the LLM that generated the configuration parameters, so that the LLM can adjust the parameters and re-execute step 306 until the output target information reaches the expected level.
[0244] As can be seen, in this embodiment, taking PDF purchase orders as the specific processing object, and based on the complete process of steps 301 to 311 above, the entire process of extracting key information from specific document types has been implemented, verifying the practicality and effectiveness of the document information extraction method. Specifically, by calling VLM to parse document attribute information and text content, comprehensive basic document data is provided for subsequent operations; by generating precise prompts pointing to key information based on document information, the extraction requirements are clarified; by matching and instantiating configuration templates, or by building instructions to call LLM to generate configuration parameters, executable specifications are provided for the extraction action; the entire process does not require manual intervention in document parsing, prompt writing, or configuration rule design. It not only efficiently completes the configuration preparation corresponding to key information such as customer order number and order date in purchase orders, but also improves the generalization ability by generating prompts from multiple documents of the same type. This ensures the adaptability of extraction rules for specific documents such as purchase orders, and provides a reusable operation paradigm for batch processing of other documents of the same type, effectively improving the efficiency and reliability of specific document information extraction.
[0245] Further reference Figure 4 As an implementation of the methods shown in the above figures, this disclosure provides an embodiment of a document information extraction device that combines RPA, AI, and LLM to realize an AI Agent. This device embodiment corresponds to the method embodiment shown in Figure 2, and the device can be applied to various electronic devices.
[0246] like Figure 4 As shown, the document information extraction device 400 that combines RPA, AI, and LLM to realize an AI Agent in this embodiment includes: a recognition module 401, a prompt word generation module 402, a configuration parameter generation module 403, and a target information extraction module 404;
[0247] The recognition module 401 is configured to recognize at least one target field in the document to be processed, wherein each target field is a metadata name of a target information in the document to be processed.
[0248] The prompt word generation module 402 is configured to generate prompt words pointing to target information using each target field;
[0249] The configuration parameter generation module 403 is configured to generate configuration parameters that match the prompt words;
[0250] The target information extraction module 404 is configured to extract various metadata of target information from the document to be processed according to the configuration parameters.
[0251] In this embodiment, the specific processing of the recognition module 401, prompt word generation module 402, configuration parameter generation module 403, and target information extraction module 404 of the document information extraction device 400 that combines RPA, AI, and LLM to realize the AI Agent, and the resulting technical effects can be referred to the relevant descriptions of steps 201, 202, 203, and 204 in the corresponding embodiment of Figure 2, which will not be repeated here.
[0252] In some alternative implementations, the identification module 401 is further configured to:
[0253] The first preset language model, LLM, is invoked to identify the document type and corresponding layout features of the document to be processed. The document type represents the content topic classification of the document to be processed, and the layout features represent the content layout of the document to be processed.
[0254] The first LLM is used to identify the fields at each location corresponding to the layout features in the document to be processed;
[0255] The first LLM is used to identify the target fields associated with the document type from the various fields.
[0256] Accordingly, before invoking the preset first large language model LLM to identify the document type and corresponding layout features of the document to be processed, the recognition module 401 also performs:
[0257] If the document to be processed is in image format, the preset Visual Language Model (VLM) is invoked to convert the image-formatted document into a text format for use in the first LLM.
[0258] In some alternative implementations, the prompt word generation module 402 is further configured to:
[0259] Drive Robotic Process Automation (RPA) to input each target field into the first LLM;
[0260] Using the first LLM, each target field is combined into a structured semantic description according to the preset semantic specification, and used as prompt words.
[0261] In some optional implementations, the configuration parameter generation module 403 is further configured to:
[0262] Using a pre-defined Natural Language Processing (NLP) service, each target field in the prompt words is semantically matched with each pre-defined configuration template to obtain the corresponding semantic matching degree.
[0263] The configuration template with the highest semantic match to each target field is determined as the configuration parameter that matches the prompt word.
[0264] In some alternative implementations, the configuration parameters include at least one rule for extracting metadata;
[0265] Accordingly, the configuration parameter generation module 403 is further configured as follows:
[0266] Based on the target fields in the prompt words, construct rule generation instructions that point to each target field;
[0267] Invoke the preset second LLM, set at least one rule for each target field pointed to by the rule generation instruction to extract the corresponding metadata, and use it as the configuration parameter for the corresponding matching.
[0268] The configuration parameters include at least the output rules, the location rules, and the extraction specifications;
[0269] Accordingly, at least one rule is set for each target field pointed to by the rule generation instruction to extract the corresponding metadata, including:
[0270] According to the preset data format, set the target data format for each target field to output the corresponding metadata, and use it as the corresponding output rule;
[0271] Based on the layout characteristics, set the target location for each target field to locate the corresponding metadata, and use it as the corresponding location rule;
[0272] Set extraction specifications for each target field to unify the extraction actions of each metadata.
[0273] In some optional implementations, the target information extraction module 404 is further configured to:
[0274] The pre-defined third LLM is invoked, and the third LLM is used to extract the metadata corresponding to each target field from the document to be processed according to the extraction specifications and each positioning rule;
[0275] The structured metadata is output according to the output rules corresponding to each target field, thus obtaining the target information containing each metadata.
[0276] Accordingly, after extracting the target information's metadata from the document to be processed according to the configuration parameters, the target information extraction module 404 can execute:
[0277] Based on artificial intelligence (AI) technology, determine whether the metadata in the target information is complete and accurate;
[0278] In response to the determination that any item in the target information's metadata is incomplete and / or inaccurate, the first LLM for generating prompt words is adjusted; and / or
[0279] Adjust the natural language processing service that generates configuration parameters or the second LLM that generates configuration parameters.
[0280] It should be noted that the implementation details and technical effects of each module in the document information extraction device that combines RPA, AI, and LLM to implement an AI Agent provided in the embodiments of this disclosure can be referred to the descriptions of other embodiments in this disclosure, and will not be repeated here.
[0281] The following is for reference. Figure 5 It shows a schematic diagram of the structure of a computer system 500 suitable for implementing the electronic device of the present disclosure. Figure 5 The computer system 500 shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments of this disclosure.
[0282] like Figure 5 As shown, the computer system 500 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 into a random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the computer system 500. The processing device 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.
[0283] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 508 including, for example, magnetic tapes, hard disks, etc.; and communication devices 509. Communication device 509 allows computer system 500 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 5 A computer system 500 with various electronic devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0284] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 509, or installed from a storage device 508, or installed from a ROM 502. When the computer program is executed by the processing device 501, it performs the functions defined in the methods of embodiments of this disclosure.
[0285] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0286] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0287] The aforementioned computer-readable medium carries one or more programs. When the aforementioned one or more programs are executed by the electronic device, the electronic device implements the document information extraction method of combining RPA, AI, and LLM to realize an AI Agent, as shown in the embodiment and optional implementation of FIG2.
[0288] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0289] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0290] The units or modules described in the embodiments of this disclosure can be implemented in software or hardware. The names of the units or modules do not necessarily limit the unit itself; for example, an identification module can also be described as "a module that identifies at least one target field in a document to be processed".
[0291] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
Claims
1. A method for extracting document information using an AI Agent by combining RPA, AI, and LLM, characterized in that, The method, applied to an AI agent, includes: Identifying at least one target field in the document to be processed, wherein each target field is a metadata name of a target information item in the document to be processed, the identification of at least one target field in the document to be processed includes: The first preset language model (LLM) is invoked to identify the document type and corresponding layout features of the document to be processed. The document type represents the content topic classification of the document to be processed, and the layout features represent the content layout of the document to be processed. The first large language model LLM is used to identify the fields at each position corresponding to the layout features in the document to be processed; The first large language model LLM is used to identify the target fields associated with the document type from each field; Generate prompt words pointing to the target information using each target field; The step of generating prompt words pointing to the target information using each target field includes: Drive Robotic Process Automation (RPA) and input each target field into the first large language model (LLM); Using the first large language model LLM, each target field is composed into a structured semantic description according to the preset semantic specifications, and used as the prompt words; Generate configuration parameters that match the prompt words, wherein the configuration parameters include at least output rules, location rules, and extraction specifications; Extracting metadata of the target information from the document to be processed according to the configuration parameters, wherein the extraction of metadata of the target information from the document to be processed according to the configuration parameters includes: The preset third language model LLM is invoked, and the metadata corresponding to each target field is extracted from the document to be processed according to the extraction specifications and each positioning rule. The structured metadata is output according to the output rules corresponding to each target field, thus obtaining the target information containing each metadata.
2. The method according to claim 1, characterized in that, Before invoking the preset first large language model (LLM) to identify the document type and corresponding layout features of the document to be processed, the method further includes: If the document to be processed is in image format, a preset visual language model (VLM) is invoked to convert the image-formatted document into text format for use with the first large language model (LLM).
3. The method according to claim 1, characterized in that, The configuration parameters for generating the prompt word include: Using a preset Natural Language Processing (NLP) service, each target field in the prompt word is semantically matched with each preset configuration template to obtain the corresponding semantic matching degree. The configuration template with the highest semantic match degree with each target field is determined as the configuration parameter that matches the prompt word.
4. The method according to claim 1, characterized in that, The configuration parameters include at least one rule for extracting metadata; and The configuration parameters for generating the prompt word include: Based on each target field in the prompt words, construct rule generation instructions pointing to each target field; The preset second large language model LLM is invoked, and at least one rule is set for each target field pointed to by the rule generation instruction to extract the corresponding metadata, which is used as the configuration parameter for the corresponding matching.
5. The method according to claim 4, characterized in that, The step of setting at least one rule for extracting corresponding metadata for each target field pointed to by the rule generation instruction includes: According to the preset data format, set the target data format for each target field to output the corresponding metadata, and use it as the corresponding output rule; According to the layout features, set the target location for each target field to locate the corresponding metadata, and use it as the corresponding location rule; Set the extraction specifications for each target field to unify the metadata extraction actions.
6. The method according to claim 4, characterized in that, After extracting the target information's metadata from the document to be processed according to the configuration parameters, the method further includes: Based on artificial intelligence (AI) technology, it is determined whether the metadata in the target information is complete and accurate; In response to determining that any item in the target information's metadata is incomplete and / or inaccurate, the first large language model (LLM) that generated the prompt word is adjusted; and / or Adjust the natural language processing service that generates the configuration parameters or the second large language model (LLM) that generates the configuration parameters.
7. A document information extraction device that combines RPA, AI, and LLM to realize an AI Agent, characterized in that, include: The identification module is configured to identify at least one target field in a document to be processed, wherein each target field is a metadata name of a target information item in the document to be processed, and the identification of at least one target field in the document to be processed includes: The first preset language model (LLM) is invoked to identify the document type and corresponding layout features of the document to be processed. The document type represents the content topic classification of the document to be processed, and the layout features represent the content layout of the document to be processed. The first large language model LLM is used to identify the fields at each position corresponding to the layout features in the document to be processed; The first large language model (LLM) is used to identify the target fields associated with the document type from each field. The prompt word generation module is configured to generate prompt words pointing to the target information using each target field; The step of generating prompt words pointing to the target information using each target field includes: Drive Robotic Process Automation (RPA) and input each target field into the first large language model (LLM); Using the first large language model LLM, each target field is composed into a structured semantic description according to the preset semantic specifications, and used as the prompt words; The configuration parameter generation module is configured to generate configuration parameters that match the prompt words, and the configuration parameters include at least output rules, positioning rules, and extraction specifications. The target information extraction module is configured to extract various metadata of the target information from the document to be processed according to the configuration parameters. The extraction of the various metadata of the target information from the document to be processed according to the configuration parameters includes: The preset third language model LLM is invoked, and the metadata corresponding to each target field is extracted from the document to be processed according to the extraction specifications and each positioning rule. The structured metadata is output according to the output rules corresponding to each target field, thus obtaining the target information containing each metadata.
8. The apparatus according to claim 7, characterized in that, The configuration parameter generation module is further configured to: Using a preset Natural Language Processing (NLP) service, each target field in the prompt word is semantically matched with each preset configuration template to obtain the corresponding semantic matching degree. The configuration template with the highest semantic match degree with each target field is determined as the configuration parameter that matches the prompt word.
9. The apparatus according to claim 7, characterized in that, The configuration parameter generation module is further configured to: Based on each target field in the prompt words, construct rule generation instructions pointing to each target field; The preset second large language model LLM is invoked, and at least one rule is set for each target field pointed to by the rule generation instruction to extract the corresponding metadata, which is used as the configuration parameter for the corresponding matching.
10. A document information extraction device that combines RPA, AI, and LLM to realize an AI Agent, characterized in that, include: One or more processors; Storage device, on which one or more programs are stored, When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-6.
11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by one or more processors, it implements the method as described in any one of claims 1-6.
12. A computer program product comprising computer program instructions, characterized in that, When the computer program instructions are executed on the computer, the computer causes the computer to perform the method as described in any one of claims 1-6.