Document image processing system
The integration of local OCR and LLM-OCR processing in the document image system addresses errors in extracting attribute pairs from irregular layouts by validating text data, enhancing accuracy in character recognition.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NET SMILE INC
- Filing Date
- 2025-05-27
- Publication Date
- 2026-06-15
AI Technical Summary
Existing document image processing systems face errors in extracting attribute label and value pairs from documents with irregular layouts, leading to non-extraction of pairs that should be extracted.
A document image processing system that combines local OCR processing with a large-scale language model (LLM-OCR) to derive similarity between text data, determining the validity of extracted text data, and outputs valid data using an integrated processing unit.
Reduces errors in extracting text characters within document images by utilizing LLM-OCR to validate text data, minimizing errors due to irregular layouts.
Smart Images

Figure 2026096910000001_ABST
Abstract
Description
【Technical Field】 【0001】 The present invention relates to a document image processing system. 【Background Art】 【0002】 A certain system extracts a set of character images in a document image, generates text data (character string) of the set of character images, generates a feature vector corresponding to the text data of the set of character images, and detects attribute label candidates and attribute value candidates for a specific attribute from the text data, and sets those pairs as pair candidates. At that time, attribute label candidates are detected based on the feature vector of the text data of the set of character images, and a pair of an attribute label and an attribute value is specified based on, for example, the degree of association between the two (see, for example, Patent Document 1). As a result, the attribute value of a certain attribute is accurately specified without using template data. 【Prior Art Documents】 【Patent Documents】 【0003】 【Patent Document 1】 Japanese Patent Application Laid-Open No. 2022-178723 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0004】 [[ID=,35]]In the above system, based on the degree of association, a pair of an attribute label and an attribute value corresponding to the attribute label is extracted from a plurality of character strings in the document image. Therefore, if the degree of association is accurately derived, inappropriate pairs will not be extracted. However, when extracting a pair of an attribute label and an attribute value from a document image with an irregular (arbitrary) various layout, there is a possibility of pair extraction errors (such as non-extraction of pairs that should be extracted). 【0005】 The present invention has been made in view of the above problems, and an object thereof is to obtain a document image processing system that suppresses extraction errors of character strings in a document image. [Means for solving the problem] 【0006】 The document image processing system according to the present invention comprises: a local OCR processing unit that performs character recognition processing on a document image and generates text data of the strings described in the document image; an LLM-OCR processing unit that inputs the document image along with prompts to a large-scale language model and obtains text data of the strings identified by the large-scale language model according to those prompts; and an integrated processing unit that derives a predetermined similarity between the text data generated by the local OCR processing unit and the text data obtained by the LLM-OCR processing unit, determines the validity of the text data obtained by the LLM-OCR processing unit based on that similarity, and outputs the text data obtained by the LLM-OCR processing unit if it is determined that the text data obtained by the LLM-OCR processing unit is valid. 【0007】 The document image processing method according to the present invention comprises: a local OCR processing step in which a computer performs character recognition processing on a document image and generates text data of the strings described in the document image; an LLM-OCR processing step in which a computer inputs the document image along with prompts to a large-scale language model and obtains text data of the strings identified by the large-scale language model from the large-scale language model according to those prompts; and an integrated processing step in which a computer derives a predetermined similarity between the text data generated in the local OCR processing step and the text data obtained in the LLM-OCR processing step, determines the validity of the text data obtained in the LLM-OCR processing step based on that similarity, and if it is determined that the text data obtained in the LLM-OCR processing step is valid, the computer outputs the text data obtained in the LLM-OCR processing step. 【0008】 The document image processing program according to the present invention causes a computer to function as the local OCR processing unit, the LLM-OCR processing unit, and the integrated processing unit described above. [Effects of the Invention] 【0009】 According to the present invention, a document image processing system that suppresses errors in extracting text characters within a document image can be obtained. 【0010】 The above or other objects, features, and advantages of the present invention will become even more apparent from the following detailed description in conjunction with the accompanying drawings. [Brief explanation of the drawing] 【0011】 [Figure 1] Figure 1 is a block diagram showing the configuration of a document image processing system according to an embodiment of the present invention. [Figure 2] Figure 2 shows an example of a prompt. [Figure 3] Figure 3 is a flowchart illustrating the operation of the document image processing system according to Embodiment 1. [Figure 4] Figure 4 illustrates the prompt data in Embodiment 2. [Figure 5] Figure 5 is a block diagram showing the configuration of a document image processing system according to Embodiment 3 of the present invention. [Modes for carrying out the invention] 【0012】 Embodiments of the present invention will be described below with reference to the figures. 【0013】 Embodiment 1. 【0014】 FIG. 1 is a block diagram showing the configuration of a document image processing system according to an embodiment of the present invention. The document image processing system 1 shown in FIG. 1 is composed of one information processing device (such as a personal computer or a server), but the processing unit described later may be distributed among a plurality of information processing devices capable of data communication with each other. Further, such a plurality of information processing devices may include a GPU (Graphics Processing Unit) that performs parallel processing of specific operations. 【0015】 The document image processing system 1 shown in FIG. 1 includes a storage device 11, a communication device 12, and an arithmetic processing device 13. 【0016】 The storage device 11 is a non-volatile storage device such as a flash memory or a hard disk, and stores various data and programs. Here, a document image processing program 11a and prompt data 11b are stored in the storage device 11. Further, system setting data (such as coefficient setting values of a learning device such as a neural network used in the local OCR processing unit 22 described later) is stored in the storage device 11 as needed. 【0017】 The document image processing program 11a may be stored in a portable computer-readable recording medium such as a CD (Compact Disk). In that case, for example, the document image processing program 11a is installed from the recording medium into the storage device 11. Further, the document image processing program 11a may be a single program or an aggregate of a plurality of programs. 【0018】 The prompt data 11b includes one or more prompts for transmission to the LLM server 3. 【0019】 The communication device 12 is a device capable of data communication such as a network interface or a modem, and performs data communication with other devices (such as the LLM server 3 and the user terminal device 4) via the network 2. 【0020】 The LLM server 3 is a server that receives input data including a prompt, executes the processing specified by the prompt in the input data using a large language model (LLM), and transmits the processing result as output data. As the LLM, GPT-4, Claude 3.5 sonnet, etc. are used. Note that the LLM server 3 may be included in the document image processing system 1. 【0021】 The user terminal device 4 is a device such as a personal computer or a smartphone. According to user operations, it transmits a document image specified by the user to the document image processing system 1 and receives specific text data (that is, the result of optical character recognition processing (OCR)) obtained from the document image by the document image processing system 1 from the document image processing system 1. The document image to be transmitted may be raster image data obtained by a scanner or a camera, or PDF data. Also, the received text data is displayed or stored by the user terminal device 4. 【0022】 Also, the arithmetic processing unit 13 is a computer equipped with a CPU (Central Processing Unit), ROM (Read Only Memory), RAM (Random Access Memory), etc. By loading a program from the ROM, the storage device 11, etc. into the RAM and executing it with the CPU, it operates as various processing units. 【0023】 Here, by executing the document image processing program 11a, the arithmetic processing unit 13 operates as the front-end processing unit 21, the local OCR processing unit 22, the LLM-OCR processing unit 23, and the integration processing unit 24. 【0024】 The front-end processing unit 21 uses the communication device 12 to perform data communication with the user terminal device 4, (a) receives document data from the user terminal device 4, converts the document data into a document image (raster image) as necessary, and (b) transmits the result of the optical character recognition processing on the document image to the user terminal device 4. 【0025】 The local OCR processing unit 22 performs character recognition processing on the document image described above and generates text data of the strings written in the document image. 【0026】 Here, the local OCR processing unit 22 performs character recognition processing without using template data and generates text data of the strings written in the document image. 【0027】 The local OCR processing unit 22 may generate text data of strings described in the document image by associating text data of strings as labels for specific items and text data of strings as values for those items. In that case, for example, the local OCR processing unit 22 may generate text data of strings described in the document image by the method described in Japanese Patent Application Publication No. 2022-178723. Specifically, in that case, (a) extract a set of character images in the document image and generate text data of the set of character images, (b) generate feature vectors corresponding to the text data of the set of character images by embedding, (c) detect attribute label candidates and attribute value candidates for specific attributes from the text data, set these pairs as pair candidates, detect attribute label candidates based on the feature vectors of the text data of the set of character images, and (d) identify pairs of attribute labels (text data of labels for specific items described later) and attribute values (text data of values for those specific items) based on the feature vectors of the attribute label candidates and the feature vectors of the attribute value candidates. 【0028】 The local OCR processing unit 22 may also perform character recognition processing using template data. The template data is data that indicates the location in the document image where the value of each item is written. 【0029】 The LLM-OCR processing unit 23 uses the communication device 12 to input a document image along with a prompt to the large-scale language model (LLM server 3), and retrieves text data of the string identified by the large-scale language model (LLM server 3) according to that prompt. 【0030】 In this embodiment, the LLM-OCR processing unit 23 uses the communication device 12 to input a document image to the LLM server 3 along with a prompt specifying the above-mentioned specific items, and obtains text data from the LLM server 3 of the string identified by the LLM server 3 for the specific items specified by the prompt. 【0031】 Figure 2 shows an example of a prompt. The prompt shown in Figure 2 is an example of a prompt for extracting values (i.e., strings that indicate specific values for items, not the strings of the items themselves) for the items "Quotation Number", "Quotation Date", "Supplier Name", "Quotation Expiration Date", and "Details" from a document image of a "Quotation". In the prompt, additional settings such as the output data format and extraction rules for the items can be set, for example, for "Quotation Date" in Figure 2. 【0032】 For example, you can specify the output data format (JSON, CSV, XML, etc.) for LLM Server 3 in the prompt. In that case, the output of LLM Server 3 will include pairs of item names (labels) and their values for each item. For example, in JSON format, the output of LLM Server 3 will be written as {"Quote Number":"11111", "Quote Date":"20241010", "Supplier Name":"XXX Corporation", ...}. 【0033】 In this embodiment, the LLM-OCR processing unit 23 selects a prompt corresponding to the document type of the document image from among multiple document type prompts included in the prompt data 11b, inputs the selected prompt and document image to the LLM server 3, and obtains text data of the string identified by the LLM server 3 according to the selected prompt from the LLM server 3. 【0034】 The integrated processing unit 24 determines the validity of the text data acquired by the LLM-OCR processing unit 23 based on a predetermined similarity between the text data generated by the local OCR processing unit 22 and the text data acquired by the LLM-OCR processing unit 23. If it determines that the text data acquired by the LLM-OCR processing unit 23 is valid, it outputs the text data acquired by the LLM-OCR processing unit 23. Specifically, if the similarity exceeds a predetermined threshold, the text data is determined to be valid; otherwise, the text data is determined to be invalid. 【0035】 Here, for example, the similarity described above is defined as the ratio of words in the text data generated by the local OCR processing unit 22 to words in the text data generated by the LLM-OCR processing unit 23 (i.e., the ratio of the total number of words in the text data generated by the local OCR processing unit 22 to the number of words that match words in the text data generated by the LLM-OCR processing unit 23). 【0036】 Alternatively, the similarity mentioned above may be the cosine similarity of the feature vectors of the text data obtained by embedding. 【0037】 Furthermore, when the local OCR processing unit 22 generates text data for each string through character recognition processing as described above, it identifies the attributes of the string (item (label), value for the item, etc.) and associates them with the text data. In this case, the similarity metric may be the similarity metric for the text data of the string with a specific attribute (e.g., the label of the item) (i.e., in this case, strings with other attributes are excluded from the similarity metric). In other words, for example, the integrated processing unit 24 determines the validity of the text data acquired by the LLM-OCR processing unit 23 based on the similarity metric between the text data generated by the local OCR processing unit 22 and the text data acquired by the LLM-OCR processing unit 23 for the specific item described above. 【0038】 Furthermore, in this embodiment, if the integrated processing unit 24 determines that the text data acquired by the LLM-OCR processing unit 23 is invalid, it adds text data generated by the local OCR processing unit 22 (in this case, pairs of item labels and item values) to the prompt. Subsequently, the LLM-OCR processing unit 23 inputs the added prompt and document image to the large-scale language model (LLM server 3), and acquires text data of the string identified by the large-scale language model according to the prompt from the large-scale language model (LLM server 3). 【0039】 Furthermore, in this embodiment, the local OCR processing unit 22 identifies the position of the string corresponding to the text data in the document image (specifically, the bounding box indicating the position and size of the string) along with the text data through character recognition processing, and the integrated processing unit 24 identifies the position of the string in the document image corresponding to the text data acquired by the LLM-OCR processing unit 23 based on the text data and string position identified by the local OCR processing unit 22. 【0040】 Next, the operation of the document image processing system according to Embodiment 1 will be described. Figure 3 is a flowchart illustrating the operation of the document image processing system according to Embodiment 1. 【0041】 The front-end processing unit 21 receives the OCR request and document data transmitted from the user terminal device 4 (step S1). If the document data is PDF data, the front-end processing unit 21 converts the document data into raster image data for each page. If the PDF data contains images of multiple pages, the front-end processing unit 21 identifies the image range of a specific document from the images of multiple pages and uses the images of that range (for example, the images of multiple pages) as the document image to be processed by OCR. 【0042】 Next, the local OCR processing unit 22 performs character recognition processing on the document image and generates text data of the strings written in the document image (step S2). 【0043】 Furthermore, the LLM-OCR processing unit 23 selects a prompt in the prompt data 11b (step S3), and uses the communication device 12 to send that prompt and the document image described above to the LLM server 3 (step S4). In response to these, it receives the processing result from the LLM server 3 (i.e., the text data which is the OCR result) from the LLM server 3 (step S5). 【0044】 Note that the processing in step S2 and the processing in steps S3 to S5 may be performed in parallel. 【0045】 Then, the integrated processing unit 24 derives the similarity between the text data obtained by the local OCR processing unit 22 and the text data obtained by the LLM-OCR processing unit 23, and determines whether the text data obtained by the LLM-OCR processing unit 23 is valid or not based on the derived similarity (step S6). 【0046】 If the LLM-OCR processing unit 23 determines that the text data obtained is valid, the integrated processing unit 24 transmits the text data as an OCR result to the user terminal device 4 (step S7). 【0047】 In this case, since the OCR result from the LLM-OCR processing unit 23 does not include the position information of the string (the coordinate values where the string in the text data is described within the document image), the integrated processing unit 24 may use the position information included in the OCR result from the local OCR processing unit 22 to add the position information to the OCR result from the LLM-OCR processing unit 23. Specifically, the integrated processing unit 24 adds the position information of the string in the OCR result from the local OCR processing unit 22 to the string in the OCR result from the LLM-OCR processing unit 23 that matches the string in the OCR result from the local OCR processing unit 22. 【0048】 The integrated processing unit 24 may, based on the OCR results to which location information has been added in this manner, for example, superimpose a string of characters onto the document image at the location indicated by the location information, and send this as the OCR result to the user terminal device 4 for display on the user terminal device 4. This makes it easier for the user to check the OCR results. 【0049】 On the other hand, if the LLM-OCR processing unit 23 determines that the text data obtained is invalid, the integrated processing unit 24 adds the text data as the OCR result from the local OCR processing unit 22 to the prompt described above (step S8), sends the updated prompt and the document image described above to the LLM server 3 (step S9), and receives the processing result from the LLM server 3 (i.e., the text data which is the OCR result) in response to them (step S10). Then, the integrated processing unit 24 sends that text data as the OCR result to the user terminal device 4 (step S7). 【0050】 Furthermore, the OCR results from the LLM server 3 for prompts with added OCR results from the local OCR processing unit 22 may also be checked for validity as described above. If valid, the text data is sent to the user terminal device 4 as the OCR result; otherwise, a warning such as an error detection is sent to the user terminal device 4. 【0051】 As described above, according to Embodiment 1, the local OCR processing unit 22 performs character recognition processing on the document image and generates text data of the strings described in the document image. The LLM-OCR processing unit 23 inputs the document image along with a prompt to the large-scale language model and obtains text data of the strings identified by the large-scale language model for the items specified by the prompt. The integrated processing unit 24 determines the validity of the text data obtained by the LLM-OCR processing unit 23 based on a predetermined similarity between the text data generated by the local OCR processing unit 22 and the text data obtained by the LLM-OCR processing unit 23. If it determines that the text data obtained by the LLM-OCR processing unit 23 is valid, it outputs the text data obtained by the LLM-OCR processing unit 23. 【0052】 This reduces errors in extracting strings within document images, which can occur due to factors such as LLM hallucination. In particular, it reduces errors in extracting string sets that have specific relationships among multiple strings within a document image. 【0053】 Embodiment 2. 【0054】 Figure 4 illustrates the prompt data in Embodiment 2. In Embodiment 2, for example, as shown in Figure 4, the prompt data 11b includes a set of prompts 41-i (i=1,···,N) for each document type, and each set of prompts 41-i includes a default prompt 51, and further includes individual prompts 52 for each specific document format, as needed. 【0055】 Each individual prompt 52 is associated with a document vector 52a of a specific document. The document vector 52a is a vector obtained by embedding a document image of a specific document format. Note that individual prompts 52 are prompts that are individually generated to obtain valid OCR results when the default prompt 51 fails to obtain valid OCR results for a document image of a specific document format, and are added along with the document vector 52a as needed. 【0056】 In Embodiment 2, the LLM-OCR processing unit 23 generates document vectors of the document image by embedding, selects an individual prompt 52 corresponding to the generated document vector from the individual prompts 52 associated with a plurality of document vectors 52a, inputs the selected individual prompt 52 and the document image to the LLM server 3, and obtains text data of the string identified by the LLM server 3 for the item specified by the selected individual prompt 52 from the LLM server 3. 【0057】 Specifically, the LLM-OCR processing unit 23 selects an individual prompt 52 corresponding to the generated document vector from the individual prompts 52 associated with multiple document vectors 52a in the document type prompt set 41-i of the document image. For example, the prompt shown in Figure 2 is the prompt for the document type "Quotation". 【0058】 Specifically, the individual prompt 52 for the document vector 52a that has the highest similarity to the generated document vector and exceeds a predetermined threshold is selected. If there are no document vectors 52a whose similarity to the generated document vector exceeds the predetermined threshold, the default prompt 51 is selected, and the default prompt 51 and the document image are input to the LLM server 3. 【0059】 The other configurations and operations of the document image processing system according to Embodiment 2 are the same as those of Embodiment 1, so their description will be omitted. 【0060】 Embodiment 3. 【0061】 Figure 5 is a block diagram showing the configuration of a document image processing system according to Embodiment 3 of the present invention. 【0062】 In Embodiment 3, the LLM-OCR processing unit 23 inputs a document image along with the above prompts to multiple different large-scale language models (here, LLM servers 3-1 to 3-N), and obtains text data of the strings identified by each of those large-scale language models according to the prompts. 【0063】 Furthermore, in Embodiment 3, the integrated processing unit 24 (a) acquires text data of strings in a document image by synthesizing text data acquired from each of its multiple large-scale language models, and (b) determines the validity of the text data acquired by the LLM-OCR processing unit 23 based on a predetermined similarity between the text data generated by the local OCR processing unit 22 and the text data obtained by synthesizing the text data acquired from each of its multiple large-scale language models. If it determines that the text data acquired by the LLM-OCR processing unit 23 is valid, it outputs the text data acquired by the LLM-OCR processing unit 23. 【0064】 In particular, in Embodiment 3, the LLM-OCR processing unit 23 inputs the same prompt and document image to multiple different large-scale language models (here, LLM servers 3-1 to 3-N), and obtains text data of the strings identified by the large-scale language models (here, LLM servers 3-1 to 3-N) for the items specified by the prompt, from the multiple different large-scale language models (here, LLM servers 3-1 to 3-N). 【0065】 In this embodiment 3, if the integrated processing unit 24 does not obtain a string corresponding to a certain item in the text data obtained from any of the multiple large-scale language models (here, LLM servers 3-1 to 3-N) from one of the large-scale language models (here, LLM server 3-i), it supplements it with a string corresponding to that item in the text data obtained from another large-scale language model (here, LLM server 3-j), thereby synthesizing the text data obtained from each of the multiple large-scale language models. 【0066】 Furthermore, in Embodiment 3, if the integrated processing unit 24 obtains multiple strings corresponding to a certain item in the text data obtained from each of the multiple large-scale language models (here, LLM servers 3-1 to 3-N), it determines the string corresponding to that item by majority vote, thereby synthesizing the text data obtained from each of the multiple large-scale language models. 【0067】 The other configurations and operations of the document image processing system according to Embodiment 3 are the same as those of Embodiments 1 or 2, so their description will be omitted. 【0068】 As described above, according to Embodiment 3, since text data obtained by synthesizing text data obtained from multiple large-scale language models that are different from each other is used, it becomes easier to extract strings of characters within document images more accurately. 【0069】 Furthermore, various changes and modifications to the embodiments described above will be obvious to those skilled in the art. Such changes and modifications may be made without deviating from the spirit and scope of the subject matter and without diminishing the intended advantages. In other words, such changes and modifications are intended to be included in the claims. 【0070】 For example, in embodiments 1 and 2 described above, the text data as an OCR result generated by the local OCR processing unit 22 may be added to the prompt initially input to the LLM server 3, so that a prompt including the text data as an OCR result generated by the local OCR processing unit 22 is input to the LLM server 3. 【0071】 Furthermore, in the above embodiments 1 and 2, the document image processing system 1 may generate template data for the document type of the document image based on the OCR results (extracted text data) obtained from the LLM server 3, after the location information has been added to the OCR results. 【0072】 Furthermore, in embodiments 1 and 2 described above, the document image processing system 1 may generate training data for a character recognition learning device (such as a deep neural network) in the local OCR processing unit 22, etc., based on the OCR results (extracted text data) obtained from the LLM server 3 after the location information has been added (i.e., annotation may be performed automatically). 【0073】 Furthermore, in embodiments 1 and 2 described above, since the local OCR processing unit 22 identifies the position of the identified string, the integrated processing unit 24 may, if it determines that the document image contains a table, identify each row of the table based on its position, and for each row, determine whether the text data in the OCR results of both systems match for each string contained in that row, as described above, in order to determine the validity of the OCR results from the LLM server 3. In other words, in this case, for multiple rows of the table, the determination of whether the text data in the OCR results of both systems match is made repeatedly for each row (along the sub-scanning direction of the document image). In addition, in this case, the position information described above may be individually added to the text data of the OCR results from the LLM server 3, line by line. 【0074】 Furthermore, in embodiments 1 to 3 described above, the local OCR processing unit 22 may identify an area (left half, lower half, etc.) in the text image where the value of a specific item is described, and then extract the item from the identified area. 【0075】 Furthermore, in the second embodiment described above, the prompt data 11b may include a default prompt set 41-j, and if the prompt set 41-k for the document type of the document image to be processed by OCR is not present in the prompt data 11b, the prompt to be entered into the LLM server 3 may be selected from the default prompt set 41-j. 【0076】 Furthermore, in embodiments 1 to 3 described above, the local OCR processing unit 22 may perform the above-mentioned character recognition processing using a learner trained on training data of a given document type for each document type. In this case, the local OCR processing unit 22 may pre-associate document vectors corresponding to document types with document types and store them in a memory device or the like, and based on the document vectors of the document images, it may automatically identify the document type of the document image based on the similarity between the two, and then perform the above-mentioned character recognition processing using a learner trained on the identified document type. [Industrial applicability] 【0077】 The present invention can be applied, for example, to the recognition processing of document images with diverse layouts. [Explanation of Symbols] 【0078】 1. Document Image Processing System 3,3-1~3-N LLM Server 11a Document Image Processing Program 13. Arithmetic Processing Unit (An example of a computer) 22 Local OCR Processing Unit 23 LLM-OCR Processing Unit 24 Integrated Processing Unit
Claims
[Claim 1] A local OCR processing unit performs character recognition processing on a document image and generates text data of the strings described in the document image, An LLM-OCR processing unit inputs the document image along with a prompt to a large-scale language model and obtains text data of the string identified by the large-scale language model according to the prompt. An integrated processing unit derives a predetermined similarity between the text data generated by the local OCR processing unit and the text data acquired by the LLM-OCR processing unit, determines the validity of the text data acquired by the LLM-OCR processing unit based on the similarity, and if it is determined that the text data acquired by the LLM-OCR processing unit is valid, outputs the text data acquired by the LLM-OCR processing unit. A document image processing system characterized by comprising the following features. [Claim 2] A local OCR processing step in which a computer performs character recognition processing on a document image and generates text data of the strings described in the document image on the computer, The LLM-OCR processing step involves inputting the document image along with a prompt to a large-scale language model on the computer, and obtaining text data of the string identified by the large-scale language model from the large-scale language model according to the prompt on the computer. An integrated processing step in which the computer derives a predetermined similarity between the text data generated in the local OCR processing step and the text data acquired in the LLM-OCR processing step, determines the validity of the text data acquired in the LLM-OCR processing step based on the similarity, and if it determines that the text data acquired in the LLM-OCR processing step is valid, the computer outputs the text data acquired in the LLM-OCR processing step. A document image processing method characterized by comprising: [Claim 3] Computers A local OCR processing unit that performs character recognition processing on a document image and generates text data of the strings described in the document image. An LLM-OCR processing unit inputs the document image along with a prompt to a large-scale language model and obtains text data of the string identified by the large-scale language model according to the prompt, and The integrated processing unit derives a predetermined similarity between the text data generated by the local OCR processing unit and the text data acquired by the LLM-OCR processing unit, determines the validity of the text data acquired by the LLM-OCR processing unit based on the similarity, and if it determines that the text data acquired by the LLM-OCR processing unit is valid, outputs the text data acquired by the LLM-OCR processing unit. A document image processing program that functions as such.