Method, device, processor and medium for online question answering by automatically parsing financial reports and converting into structured data

CN117493518BActive Publication Date: 2026-06-16GUOTAI JUNAN SECURITIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUOTAI JUNAN SECURITIES CO LTD
Filing Date
2023-11-16
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

[0003]1)、数据查看困难:财务报表有很多,PDF财务报表又既有文本又有表格,如何从中快速查找关键指标,通常需要人工解决

🎯Benefits of technology

[0038]采用了本发明的通过自动解析财务报告并转化为结构化数据实现在线问答处理的方法、装置、处理器及其计算机可读存储介质,开发出了一种新的方法来处理知识文档,以便更有效地进行分析、提取、以及记录数据,同时支持机器人直接问答。该方法确保了数据的一致性和安全性,提高了问答的可信度和可靠性。

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117493518B_ABST
    Figure CN117493518B_ABST
Patent Text Reader

Abstract

The present application relates to a method for realizing online question and answer processing by automatically analyzing financial reports and converting them into structured data, comprising the following steps: classifying and analyzing the user input files according to the format in the offline process; through p-tuning fine-tuning training of the AI model, analyzing the context and entities in the question in the offline process; through the model to process the financial question and answer in the online process. The present application also relates to a device, a processor and a storage medium for realizing online question and answer processing by automatically analyzing financial reports and converting them into structured data. The method, device, processor and computer readable storage medium of the present application for realizing online question and answer processing by automatically analyzing financial reports and converting them into structured data develop a new method to process knowledge documents in order to more effectively analyze, extract and record data, while supporting direct question and answer of robots. The method ensures the consistency and security of the data, and improves the credibility and reliability of the question and answer.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large model tuning and prompt engineering design, and more particularly to the field of document data processing and storage. Specifically, it refers to a method, apparatus, processor, and computer-readable storage medium for online question-and-answer processing by automatically parsing financial reports and converting them into structured data. Background Technology

[0002] As the digitization of financial statements accelerates, more and more companies are using PDF format to submit their financial statements. However, traditional PDF financial statement analysis methods have many pain points, such as:

[0003] 1) Difficulty in viewing data: There are many financial statements, and PDF financial statements contain both text and tables. Finding key indicators quickly usually requires manual intervention.

[0004] 2) Difficulty in data extraction: How to correctly and effectively extract PDF tables has always been a major challenge in the industry.

[0005] Common publicly available methods include OCR extraction technology, but OCR technology is relatively complex to operate, requires multiple devices, and has a high error rate in effectively extracting line breaks from tables. Furthermore, there have been no previous applications combining this extraction with large-scale models. This invention focuses on leveraging the capabilities of large-scale models to empower the extraction and parsing of financial statements. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method, apparatus, processor and computer-readable storage medium thereof that automatically parses financial reports and converts them into structured data to achieve online question-and-answer processing, which meets the requirements of security, credibility and reliability.

[0007] To achieve the above objectives, the present invention provides a method, apparatus, processor, and computer-readable storage medium for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as follows:

[0008] This method for online question-and-answer processing, which automatically parses financial reports and converts them into structured data, is characterized by the following steps:

[0009] (1) During offline processing, classify and parse the user-input files according to their format;

[0010] (2) Fine-tune the training of the AI ​​model through p-tuning during offline process, and parse the context and entities in the problem;

[0011] (3) Financial Q&A processing is performed through the model during the online process.

[0012] Preferably, step (1) specifically includes the following steps:

[0013] (1.1) Determine the format of the file entered by the user. If it is a Word file, continue to step (1.2); if it is a TXT file, continue to step (1.3); if it is a PDF file, continue to step (1.4).

[0014] (1.2) Segment the paragraphs according to punctuation marks, and call the embedding interface of the large model to vectorize each sentence and store it in the database;

[0015] (1.3) Parse the Word file using a Java project, and process the parsed text into a database as a txt document;

[0016] (1.4) Using a PDF processing tool, a self-developed algorithm is used to separate and extract the text and tables from the PDF, and key indicators are extracted from the consolidated cash flow statement, consolidated balance sheet and consolidated income statement and stored in the database.

[0017] Preferably, step (1.2) further includes the following steps:

[0018] The table is split into columns and stored separately, and the text portion is processed as a txt document and stored in the database.

[0019] Preferably, step (1.4) of separating and extracting the PDF specifically includes the following steps:

[0020] (1.4.1) Traverse the rows of the current table to determine the actual number of columns in the table, traverse row by row, merge blank rows, and use a PDF processing tool to parse the text into JSON;

[0021] (1.4.2) For the extracted JSON, judge each line, remove the header and footer lines, and extract each Excel-type table and its table name;

[0022] (1.4.3) Extract the consolidated cash flow statement, consolidated balance sheet, and consolidated income statement according to the table names;

[0023] (1.4.4) Extract individual column information from the extracted table according to different column headers, and store the metadata in the MySQL database;

[0024] (1.4.5) Extract cell information from the extracted single-list cells, assemble them into sentences, perform vector transformation, and store them in the vector database.

[0025] Preferably, step (2) specifically includes the following steps:

[0026] (2.1) Prepare the corpus format;

[0027] (2.2) The model is trained by fine-tuning through p-tuning.

[0028] Preferably, step (3) specifically includes the following steps:

[0029] (3.1) The contextual elements of the problem are completed by fine-tuning the large model after p-tuning to confirm the user problem;

[0030] (2) Based on the user's parameters, determine whether the question and answer is based on a traditional knowledge base or on document-based knowledge. If it is based on a traditional knowledge base, call the old system; if it is based on document-based knowledge, search the vector database and find the most relevant key indicators from the vector database.

[0031] (3) Find the related single-list cell information in MySQL based on the key indicators in the vector database;

[0032] (4) Input the single-column information and the original question into the large model, and obtain the answer by combining the prompt ability of the large model.

[0033] The main feature of this apparatus for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data is that the apparatus comprises:

[0034] A processor is configured to execute computer-executable instructions;

[0035] The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the steps of the method described above for online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0036] The processor used to implement online question-and-answer processing by automatically parsing financial reports and converting them into structured data is characterized in that the processor is configured to execute computer-executable instructions, which, when executed by the processor, implement the various steps of the method described above for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0037] The computer-readable storage medium is characterized in that it stores a computer program that can be executed by a processor to implement the steps of the above-described method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0038] This invention employs the method, apparatus, processor, and computer-readable storage medium of the present invention to automatically parse financial reports and convert them into structured data for online question-and-answer processing. A novel approach to processing knowledge documents has been developed to more effectively analyze, extract, and record data, while supporting direct question-and-answer by chatbots. This method ensures data consistency and security, and improves the credibility and reliability of question-and-answer responses. Attached Figure Description

[0039] Figure 1 This is a flowchart illustrating the document-to-structured data processing step in the method of online question-and-answer processing according to the present invention, which automatically parses financial reports and converts them into structured data.

[0040] Figure 2 This is a flowchart illustrating the financial question-and-answer processing method of the present invention, which automatically parses financial reports and converts them into structured data to achieve online question-and-answer processing.

[0041] Figure 3a This is a schematic diagram of the financial expert Q&A interface of the method for online Q&A processing that automatically parses financial reports and converts them into structured data according to the present invention.

[0042] Figure 3b This is a schematic diagram of the configuration interface for a financial expert in the method of online question-and-answer processing that automatically parses financial reports and converts them into structured data, according to the present invention.

[0043] Figure 3c This is a schematic diagram of the financial report upload interface of the method for online question and answer processing that automatically parses financial reports and converts them into structured data according to the present invention.

[0044] Figure 4 This is a schematic diagram of the structured data after the financial report conversion, which is part of the method for online question-and-answer processing of the present invention by automatically parsing financial reports and converting them into structured data. Detailed Implementation

[0045] To more clearly describe the technical content of the present invention, the following description is provided in conjunction with specific embodiments.

[0046] The present invention provides a method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, comprising the following steps:

[0047] (1) During offline processing, classify and parse the user-input files according to their format;

[0048] (2) Fine-tune the training of the AI ​​model through p-tuning during offline process, and parse the context and entities in the problem;

[0049] (3) Financial Q&A processing is performed through the model during the online process.

[0050] In a preferred embodiment of the present invention, step (1) specifically includes the following steps:

[0051] (1.1) Determine the format of the file entered by the user. If it is a Word file, continue to step (1.2); if it is a TXT file, continue to step (1.3); if it is a PDF file, continue to step (1.4).

[0052] (1.2) Segment the paragraphs according to punctuation marks, and call the embedding interface of the large model to vectorize each sentence and store it in the database;

[0053] (1.3) Parse the Word file using a Java project, and process the parsed text into a database as a txt document;

[0054] (1.4) Using a PDF processing tool, a self-developed algorithm is used to separate and extract the text and tables from the PDF, and key indicators are extracted from the consolidated cash flow statement, consolidated balance sheet and consolidated income statement and stored in the database.

[0055] In a preferred embodiment of the present invention, step (1.2) further includes the following steps:

[0056] The table is split into columns and stored separately, and the text portion is processed as a txt document and stored in the database.

[0057] In a preferred embodiment of the present invention, step (1.4) of separating and extracting the PDF specifically includes the following steps:

[0058] (1.4.1) Traverse the rows of the current table to determine the actual number of columns in the table, traverse row by row, merge blank rows, and use a PDF processing tool to parse the text into JSON;

[0059] (1.4.2) For the extracted JSON, judge each line, remove the header and footer lines, and extract each Excel-type table and its table name;

[0060] (1.4.3) Extract the consolidated cash flow statement, consolidated balance sheet, and consolidated income statement according to the table names;

[0061] (1.4.4) Extract individual column information from the extracted table according to different column headers, and store the metadata in the MySQL database;

[0062] (1.4.5) Extract cell information from the extracted single-list cells, assemble them into sentences, perform vector transformation, and store them in the vector database.

[0063] In a preferred embodiment of the present invention, step (2) specifically includes the following steps:

[0064] (2.1) Prepare the corpus format;

[0065] (2.2) The model is trained by fine-tuning through p-tuning.

[0066] In a preferred embodiment of the present invention, step (3) specifically includes the following steps:

[0067] (3.1) The contextual elements of the problem are completed by fine-tuning the large model after p-tuning to confirm the user problem;

[0068] (2) Based on the user's parameters, determine whether the question and answer is based on a traditional knowledge base or on document-based knowledge. If it is based on a traditional knowledge base, call the old system; if it is based on document-based knowledge, search the vector database and find the most relevant key indicators from the vector database.

[0069] (3) Find the related single-list cell information in MySQL based on the key indicators in the vector database;

[0070] (4) Input the single-column information and the original question into the large model, and obtain the answer by combining the prompt ability of the large model.

[0071] The present invention relates to an apparatus for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data, wherein the apparatus comprises:

[0072] A processor is configured to execute computer-executable instructions;

[0073] The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the steps of the method described above for online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0074] The present invention relates to a processor for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data. The processor is configured to execute computer-executable instructions, which, when executed by the processor, implement the various steps of the method described above for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0075] The computer-readable storage medium of the present invention stores a computer program that can be executed by a processor to implement the various steps of the above-described method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data.

[0076] In the specific embodiments of the present invention, the technical algorithm of this solution is divided into two main parts: one is the offline processing process, and the other is the online question-and-answer process.

[0077] I. Offline Process – Parsing and processing user-input files categorized as Word, TXT, and PDF.

[0078] (1.1) TXT documents: paragraphs are divided by punctuation marks, and the large model embedding interface is called (the publicly available text2vec-base-chinese model is used in this invention) to vectorize each sentence and store it in the database.

[0079] (1.2) Word document: The Java project parses the Word document and processes the parsed text into the database as a TXT document.

[0080] (1.3) PDF documents (currently only financial report PDFs are supported): Using the pdfplumber tool, a self-developed algorithm is used to separate and extract the text and tables. Key indicators for consolidated cash flow statements, consolidated balance sheets, and consolidated income statements are extracted and stored in the database. Testing revealed that chatglm is prone to misalignment in questions and answers with tabular text; therefore, this invention stores the tables by column headers. Comparative testing significantly enhances the reliability of the answers. The conversion results can be found in the appendix. Figure 4 For the text portion, it is processed and stored in the database as a .txt file.

[0081] The core algorithm for PDF extraction is:

[0082] (1.3.1) Parse the text into JSON using pdfplumber. In this process, it is necessary to pay attention to merging line breaks. First, iterate through the rows of the current table to determine the actual number of columns in the table, and then iterate through each row to merge empty rows (judgment logic: no empty columns above or below).

[0083] (1.3.2) For the extracted JSON, judge each line, remove the header and footer lines, and extract each Excel table and its table name;

[0084] (1.3.3) Extract the consolidated cash flow statement, consolidated balance sheet, and consolidated income statement by table name;

[0085] (1.3.4) Extract individual column information from the extracted table according to different column headers, and store the metadata in the MySQL database;

[0086] (1.3.5) Extract cell information from the extracted single-list cells, assemble them into sentences, perform vector transformation, and store them in the vector database.

[0087] If this solution were to directly extract tables using basic tools, issues such as misaligned rows and empty columns would arise. Therefore, the pdfplumber tool was technically optimized. Based on the extracted tables and text, line breaks were merged, and empty rows and columns were removed. The following algorithm was used, particularly for merging line breaks:

[0088] (1.3.1.1) Extract by line and parse it into one of the following: text, table, header, or footer;

[0089] (1.3.1.2) Iterate through each individual table to obtain it;

[0090] (1.3.1.3) Remove empty rows and columns from the table (transpose the two-dimensional array and determine if it is an empty row or column).

[0091] (1.3.1.4) Traverse each row of the table and, based on the principle that a row should not be entirely empty, determine whether the current row needs to be merged into the previous row.

[0092] II. Offline Process – p-tuning fine-tuning training of chatglm2-6b (used to parse the context and entities in the problem).

[0093] (1) The format for preparing the corpus is as follows:

[0094] {"content":"Please extract five entities from the following question using your knowledge: company abbreviation, stock code, full company name, year, and target question, and return the answer in JSON format. The question is: XXXX".","summary":"{"Company abbreviation":","Full company name":","Stock code":","Year":","Target question":"}"}"}"

[0095] (2) Model training: p-tuning fine-tuning training was adopted. The training results can be referred to in Figure 5. The reliability of the model is estimated to be about 99%.

[0096] III. Online Process – Online Q&A Process with Financial Experts

[0097] (1) First, complete the contextual elements of the problem through the large model after p-tuning to confirm the user problem;

[0098] (2) Determine whether the question-and-answer system is based on traditional knowledge bases or on document-based knowledge based on the user's parameters;

[0099] (3) Traditional knowledge base question answering calls the old system. Knowledge base question answering needs to search the vector database to find the most relevant key indicators from the vector database.

[0100] (4) Find the related single-list cell information in MySQL based on the key indicators in the vector database;

[0101] (5) Input the single-column information and the original question into the large model, and combine the prompt ability of the large model to obtain the answer.

[0102] An example of the model training results of the present invention is shown below:

[0103] epoch: 68.38

[0104] "predict_bleu-4":99.71520943396227,

[0105] "predict_rouge-1": 99.91419245283018,

[0106] "predict_rouge-2": 99.7677245283019,

[0107] "predict_rouge-l": 99.88678113207548,

[0108] "predict_runtime": 86.3025,

[0109] "predict_samples": 53

[0110] "predict_samples_per_second": 0.614,

[0111] "predict_steps_per_second": 0.614,

[0112] "train_loss": 0.0040636288324991865,

[0113] "train_runtime": 8567.1811,

[0114] "train_samples": 702

[0115] "train_samples_per_second": 5.603,

[0116] "train_steps_per_second": 0.35. For the specific implementation scheme of this embodiment, please refer to the relevant description in the above embodiments, which will not be repeated here.

[0117] It is understood that the same or similar parts in the above embodiments can be referred to each other, and the contents not described in detail in some embodiments can be referred to the same or similar contents in other embodiments.

[0118] It should be noted that in the description of this invention, the terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance. Furthermore, in the description of this invention, unless otherwise stated, "a plurality of" means at least two.

[0119] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of the preferred embodiments of the invention includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of the invention pertain.

[0120] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution device. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0121] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The corresponding program can be stored in a computer-readable storage medium. When the program is executed, it includes one or a combination of the steps of the method embodiments.

[0122] Furthermore, the functional units in the various embodiments of the present invention can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

[0123] The storage media mentioned above can be read-only memory, disk, or optical disk, etc.

[0124] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0125] This invention employs the method, apparatus, processor, and computer-readable storage medium of the present invention to automatically parse financial reports and convert them into structured data for online question-and-answer processing. A novel approach to processing knowledge documents has been developed to more effectively analyze, extract, and record data, while supporting direct question-and-answer by chatbots. This method ensures data consistency and security, and improves the credibility and reliability of question-and-answer responses.

[0126] In this specification, the invention has been described with reference to specific embodiments thereof. However, it will be apparent that various modifications and variations can be made without departing from the spirit and scope of the invention. Therefore, the specification and drawings should be considered illustrative rather than restrictive.

Claims

1. A method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, characterized in that, The method includes the following steps: (1) During offline processing, the user-input files are categorized by format and parsed. (2) Fine-tune the training of the AI ​​model through p-tuning during offline processing, and parse the context and entities in the question; (3) Financial Q&A processing is performed online using a model; Step (1) specifically includes the following steps: (1.1) Determine the format of the file entered by the user. If it is a Word file, continue to step (1.2); if it is a TXT file, continue to step (1.3); if it is a PDF file, continue to step (1.4). (1.2) Segment the paragraphs according to punctuation marks, and call the embedding interface of the large model to vectorize each sentence and store it in the database; (1.3) Parse the Word file using a Java project, and process the parsed text into a database as a txt document; (1.4) Using a PDF processing tool, a self-developed algorithm is used to separate and extract the text and tables from the PDF, and key indicators are extracted from the consolidated cash flow statement, consolidated balance sheet and consolidated income statement and stored in the database; The step (1.4) of separating and extracting from the PDF specifically includes the following steps: (1.4.1) Traverse the rows of the current table to determine the actual number of columns in the table, traverse row by row, merge blank rows, and use a PDF processing tool to parse the text into JSON; (1.4.2) For the extracted JSON, judge each line, remove the header and footer lines, and extract each Excel-type table and its table name; (1.4.3) Extract the consolidated cash flow statement, consolidated balance sheet, and consolidated income statement according to the table names; (1.4.4) Extract individual column information from the extracted table according to different column headers, and store the metadata in the MySQL database; (1.4.5) Extract cell information from the extracted single-column cells, assemble them into sentences, perform vector transformation, and store them in a vector database; Step (3) specifically includes the following steps: (3.1) The contextual elements of the problem are completed by fine-tuning the large model after p-tuning to confirm the user problem; (3.2) Based on the user's parameters, determine whether the question and answer is based on a traditional knowledge base or on document-based knowledge. If it is based on a traditional knowledge base, call the old system; if it is based on document-based knowledge, search the vector database and find the most relevant key indicators from the vector database. (3.3) Find the related single-list cell information in MySQL based on the key indicators in the vector database; (3.4) Input the single-column information and the original question into the large model, and obtain the answer by combining the prompt ability of the large model.

2. The method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as described in claim 1, is characterized in that... Step (1.2) further includes the following steps: The table is split into columns and stored separately, and the text portion is processed as a txt document and stored in the database.

3. The method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as described in claim 1, is characterized in that... Step (2) specifically includes the following steps: (2.1) Prepare the corpus format; (2.2) The model is trained by fine-tuning through p-tuning.

4. An apparatus for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data, characterized in that, The device includes: A processor is configured to execute computer-executable instructions; The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the steps of the method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as described in any one of claims 1 to 3.

5. A processor for implementing online question-and-answer processing by automatically parsing financial reports and converting them into structured data, characterized in that, The processor is configured to execute computer-executable instructions, which, when executed by the processor, implement the steps of the method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as described in any one of claims 1 to 3.

6. A computer-readable storage medium, characterized in that, It stores a computer program that can be executed by a processor to implement the steps of the method for online question-and-answer processing by automatically parsing financial reports and converting them into structured data, as described in any one of claims 1 to 3.