A bill text recognition method and device, computer equipment and storage medium

CN115984890BActive Publication Date: 2026-06-30PING AN HEALTH INSURANCE CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: PING AN HEALTH INSURANCE CO LTD
Filing Date: 2022-12-06
Publication Date: 2026-06-30

Application Information

Patent Timeline

06 Dec 2022

Application

30 Jun 2026

Publication

CN115984890B

IPC: G06V30/42; G06V30/413; G06V30/412; G06F40/295; G06N3/045

AI Tagging

Technology Topics

Text recognitionDocument recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A multi-task OCR certificate text dataset collaborative generation and division method
CN122290143AText recognitionData set
system
JP2026105483AText recognitionEngineering
A meter reading method, device and system combining a power scenario with a large model
CN121170809BText recognitionAlgorithm
A form information recognition method, device and related equipment
CN115565197BText recognitionMedicine
Timestamp acquisition method, device, and program product for speech recognition results
CN122290600ATime informationText recognition

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies for recognizing medical invoices are prone to errors due to folding and line breaks in some fields, requiring manual intervention and increasing the cost of information technology for claims and reimbursement.

Method used

The system employs a pre-defined recognition and localization model and a multimodal transformer model to perform text recognition and named entity extraction on the ticket images. It combines layout information to construct a candidate set of entity pairs and merges related entities through an association judgment model to perform text merging.

Benefits of technology

It improves the text recognition accuracy of invoices when they are folded or have line breaks in some fields, reduces the need for manual review, and lowers the cost of information technology.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115984890B_ABST

Patent Text Reader

Abstract

This application belongs to the field of document text recognition technology in artificial intelligence, and relates to a document text recognition method, including acquiring a document image to be recognized; performing text recognition on the document image using a preset recognition and localization model; extracting named entities from the text information using a preset multimodal transformer model; constructing a candidate set of entity pairs based on preset pairing rules, combining multiple named entities and layout information; judging whether each entity pair is related using a preset association judgment model; and merging entity pairs that are judged to be related. This application also provides a document recognition device, computer equipment, and storage medium. Furthermore, this application relates to blockchain technology, allowing users to store document images and text information in the blockchain. This application improves the accuracy of document text recognition in cases of document folding or partial line breaks.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, computer equipment, and storage medium for recognizing invoice text. Background Technology

[0002] Medical bills are essential documents for medical insurance reimbursement. A medical bill includes key fields such as the patient's name, invoice number, total amount, expense item information, payment from the pooled fund, and the date of the visit. Currently, various formats of medical bills exist across the country, and the location and format of these key fields are inconsistent. Even with the national promotion of electronic invoicing, a significant proportion of hospitals have not yet adopted electronic invoicing, and the printed information in the "Other Information" section of electronic invoices varies among hospitals. These circumstances mean that data entry personnel for medical insurance reimbursement must pay attention to the different information on different invoice formats based on their understanding of the procedures.

[0003] Structured recognition in medical billing scenarios typically employs the following solutions: using OCR recognition models to perform text recognition on medical invoices and extracting the full text based on NLP technology; extracting the required key field information based on fixed field slicing or fixed regions; performing recognition and matching in blocks based on multiple detection and segmentation models; and customizing a large number of parsing templates to divert different types of invoices to the corresponding parsing processes.

[0004] However, in practical applications, the following problems often occur: the paper is thin and easily folded and bent, causing multiple pieces of information for the same expense item on the bill to be on different horizontal lines, and information such as name and amount cannot be matched one by one; some fields have line breaks, and only the information of the first line can be extracted.

[0005] Existing technologies have not been specifically optimized for the characteristics of medical invoices, such as easy folding and line breaks in some fields, which often leads to errors in recognition results. This results in more manual intervention steps, prolongs the process, and increases the information cost of claims reimbursement. Summary of the Invention

[0006] The purpose of this application is to provide a method, apparatus, computer device, and storage medium for recognizing invoice text, so as to solve the problem that text recognition in the prior art is prone to errors when invoices are folded or some fields have line breaks.

[0007] To address the aforementioned technical problems, this application provides a method, apparatus, computer device, and storage medium for recognizing invoice text, employing the following technical solutions:

[0008] A method for recognizing invoice text includes the following steps:

[0009] Obtain the image of the ticket to be identified;

[0010] The document image is subjected to text recognition using a preset recognition and positioning model to obtain text information and corresponding layout information;

[0011] The text information is extracted using a pre-defined multimodal transformer model to obtain multiple corresponding named entities;

[0012] Based on preset pairing rules, a candidate set of entity pairs is constructed by combining the multiple named entities and the layout information;

[0013] The existence of an association between each entity pair is determined by a preset association judgment model;

[0014] The entity pairs that are determined to be related are merged to obtain the merged text.

[0015] Furthermore, prior to the step of performing text recognition on the ticket image using a preset recognition and positioning model, the method further includes:

[0016] Determine whether the ticket image is deflected;

[0017] If there is a deflection, the ticket image is rotated to obtain a positive ticket image.

[0018] Furthermore, prior to the step of performing text recognition on the ticket image using a preset recognition and positioning model, the method further includes:

[0019] The ticket image is input into a preset semantic segmentation model to obtain the corresponding mask image;

[0020] Extract the boundaries of the connected components of the mask image, and define the minimum bounding rectangle region of the boundaries;

[0021] The area outside the rectangular region of the ticket image is filled with white.

[0022] Furthermore, before the step of extracting named entities from the text information using a preset multimodal transformer model, the method further includes:

[0023] The system determines whether the text information contains synonyms by querying a pre-defined thesaurus.

[0024] When it is determined that a synonym exists, the text information is replaced with the synonym.

[0025] Furthermore, the step of extracting named entities from the text information using a preset multimodal transformer model specifically includes:

[0026] The ticket image, the text information, and the layout information are input into the preset multimodal transformer model to extract named entities, thereby obtaining the multiple named entities.

[0027] Furthermore, after the step of merging the entity pairs that are determined to be related, the method further includes:

[0028] Based on preset judgment rules, a target judgment result for the credibility of the merged text is generated.

[0029] Furthermore, after the step of merging the entity pairs that are determined to be related to obtain the merged text, the method further includes:

[0030] The ticket image is input into the recognition and localization model for text recognition to obtain the text confidence rate.

[0031] The text information is input into the multimodal transformer model for named entity extraction to obtain the named confidence rate;

[0032] The entity pairs are input into the association judgment model to determine whether an association exists, and the judgment confidence rate is obtained.

[0033] The confidence target judgment result of the merged text is generated by averaging or weighting at least two of the text confidence rate, the naming confidence rate, and the judgment confidence rate.

[0034] When the target judgment result exceeds the preset threshold, the merged text is output directly.

[0035] To address the aforementioned technical problems, this application also provides a document text recognition device, which employs the following technical solution:

[0036] The acquisition module is used to acquire the image of the ticket to be identified;

[0037] The recognition module is used to perform text recognition on the ticket image using a preset recognition and positioning model to obtain text information and corresponding layout information;

[0038] The extraction module is used to extract named entities from the text information using a preset multimodal transformer model to obtain multiple corresponding named entities;

[0039] The construction module is used to construct a candidate set of entity pairs based on preset pairing rules, combining the multiple named entities and the layout information;

[0040] The judgment module is used to determine whether there is a relationship between each entity pair using a preset association judgment model;

[0041] The merge module is used to merge entity pairs that are determined to be related, resulting in merged text.

[0042] To address the aforementioned technical problems, this application also provides a computer device that employs the following technical solution:

[0043] A computer device includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor executes the computer-readable instructions to implement the steps of the invoice text recognition method described above.

[0044] To address the aforementioned technical problems, this application also provides a computer-readable storage medium, employing the technical solution described below:

[0045] A computer-readable storage medium storing computer-readable instructions, which, when executed by a processor, implement the steps of the invoice text recognition method described above.

[0046] Compared with the prior art, the embodiments of this application have the following advantages: This application performs text recognition on the ticket image, then extracts entities from the recognized text information, and constructs a candidate set of entity pairs by combining multiple extracted named entities and layout information. By judging whether each entity pair is related, the entity pairs that are related are merged, thereby effectively merging the text information with high correlation and improving the accuracy of text recognition on the ticket in cases of folding, partial field line breaks, etc. Attached Figure Description

[0047] To more clearly illustrate the solutions in this application, the accompanying drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 This is an exemplary system architecture diagram to which this application can be applied;

[0049] Figure 2 This is a flowchart of an embodiment of the invoice text recognition method according to this application;

[0050] Figure 3 This is a flowchart of another embodiment of the invoice text recognition method according to this application;

[0051] Figure 4 This is a flowchart of another embodiment of the invoice text recognition method according to this application;

[0052] Figure 5 This is a schematic diagram of the structure of one embodiment of the invoice text recognition device according to this application;

[0053] Figure 6 This is a schematic diagram of the structure of one embodiment of the computer device according to this application. Detailed Implementation

[0054] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having," and any variations thereof, in the specification, claims, and foregoing drawings of this application, are intended to cover non-exclusive inclusion. The terms "first," "second," etc., in the specification, claims, or foregoing drawings of this application are used to distinguish different objects, not to describe a particular order.

[0055] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0056] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

[0057] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.

[0058] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social media platform software, etc.

[0059] Terminal devices 101, 102, and 103 can be various electronic devices with displays and support web browsing, including but not limited to smartphones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), MP4 players (Moving Picture Experts Group Audio Layer IV), laptops, and desktop computers, etc.

[0060] Server 105 can be a server that provides various services, such as a backend server that supports the pages displayed on terminal devices 101, 102, and 103.

[0061] It should be noted that the document text recognition method provided in this application is generally executed by a server / terminal device, and correspondingly, the document text recognition device is generally set in the server / terminal device.

[0062] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0063] Continue to refer to Figure 2 The diagram illustrates a flowchart of an embodiment of the invoice text recognition method according to this application. The invoice text recognition method includes the following steps:

[0064] Step S201: Obtain the image of the ticket to be identified.

[0065] In this embodiment, the document image text recognition method operates on an electronic device (e.g., Figure 1 The server / terminal device shown can respond to the client's ticket image upload request and receive the ticket image uploaded by the client via wired or wireless connection. It should be noted that the aforementioned wireless connection methods may include, but are not limited to, 3G / 4G / 5G connections, WiFi connections, Bluetooth connections, WiMAX connections, Zigbee connections, UWB (ultra-wideband) connections, and other currently known or future wireless connection methods.

[0066] Step S202: The document image is subjected to text recognition using a preset positioning and recognition model to obtain text information and corresponding layout information.

[0067] In this embodiment, the identification and positioning model can use various existing models with identification and positioning functions to perform text identification and positioning, such as the Optical Character Recognition (OCR) model.

[0068] Specifically, the ticket image can be used as the input to the OCR model, and the output can be the corresponding text information (such as multiple text information slices) in the ticket image and the layout information (such as coordinate information) of each text information.

[0069] Step S203: Named entities are extracted from the text information using a preset multimodal transformer model to obtain multiple corresponding named entities.

[0070] In this embodiment, the overall structure of the multimodal transformer model is an encoder-decoder framework. Each encoder can include multiple encoder sub-modules, and each decoder can include multiple decoder sub-modules. A dedicated vocabulary can be built based on words in fields such as medical bills and insurance bills (e.g., words appearing in the field of medical bills: name, invoice number, expense item name, expense item amount) to pre-train the model.

[0071] In one embodiment, text information can be directly used as input to a preset multimodal transformer model, and multiple named entities can be output (each named entity includes the entity type and the corresponding multimodal information).

[0072] In another embodiment, the ticket image, text information, and layout information can be input into a preset multimodal transformer model, which outputs multiple named entities. By inputting the ticket image, text information, and layout information into the preset multimodal transformer model, multi-dimensional references of image, text, and layout can be provided for entity extraction, thereby improving the accuracy of entity extraction.

[0073] Multimodal information refers to information in multiple modalities, including text, images, video, and audio. Based on the above examples, for instance, when the input to the transformer model is a single text modality, the output is also a text modality; when the input model consists of a ticket image, text, and layout, the output can be a fused multimodal representation of the ticket image, text, and layout.

[0074] In one embodiment, the different text information slices can be connected using the [SEP] flag; the layout information can be the coordinates of each text information slice obtained based on the recognition and positioning model.

[0075] Taking a medical bill as an example, by extracting named entities from multiple text information slices obtained by text recognition of the medical bill image in step S202, entity types such as name, invoice number, expense item name, expense item amount, and multimodal information corresponding to each entity can be extracted.

[0076] Step S204: Based on preset pairing rules, construct a candidate set of entity pairs by combining multiple named entities and their corresponding layout information.

[0077] In this embodiment, different pairing rules can be preset based on different application scenarios. Taking medical bills as an example, entity pairs with the same entity type, such as name, quantity, unit or amount, and overlapping in the X-axis direction can be added to the candidate relationship set; entity pairs with different expense item entity types, but close in the Y-axis direction can be added to the candidate relationship set.

[0078] Step S205: Determine whether there is a relationship between each entity pair using a preset association judgment model.

[0079] In this embodiment, the association judgment model can be a classification model using a dual affine attention mechanism. Multiple named entities (each named entity may include an entity type and its corresponding multimodal information) output by the multimodal transformer model can be used as input to the association judgment model to obtain a classification judgment result indicating whether entity pairs are associated. Specifically, the multimodal information of the head node, the entity category information of the head node, the multimodal information of the tail node, and the entity category information of the tail node of each entity pair can be concatenated and input into the association judgment model. For example, taking any two named entities output by the multimodal transformer model as an entity pair, one named entity's multimodal information is "text + image + layout" and its entity type is "fee item name," while the other named entity's multimodal information is "text + image + layout" and its entity type is "fee item amount." After connecting the information of the two named entities, the multimodal information of the head node is "text + image + layout" and the entity category information of the head node is "fee item name". The multimodal information of the tail node is "text + image + layout" and the entity category of the tail node is "fee item amount". The above connected information is input into the judgment model, and the judgment model will output the judgment result on whether the entity pair is related.

[0080] Step S206: Merge the entity pairs that are determined to be related to obtain the merged text.

[0081] In this embodiment, in step S205, entity pairs that are determined to be related are merged, while entity pairs that are determined to be unrelated are not merged. For example, line break information can be merged, and information of the same category of expense items (such as name, quantity, unit, and amount) can be integrated.

[0082] This application performs text recognition on the invoice image, then extracts entities from the recognized text information, and constructs a candidate set of entity pairs by combining multiple extracted named entities and layout information. By judging whether each entity pair is related, the related entity pairs are merged, thereby effectively merging highly related text information and improving the accuracy of text recognition on invoices in cases of folding, partial field wrapping, etc.

[0083] Continue to refer to Figure 3 The diagram illustrates a flowchart of another embodiment of the document text recognition method according to this application. In some optional implementations of this embodiment, after acquiring the document image to be recognized in step 201 and before performing text recognition on the document image using a preset recognition and positioning model in step 202, the electronic device may further perform the following steps:

[0084] Step S207: Determine whether there is any deviation in the ticket image.

[0085] In step S208, if there is a deflection, the ticket image is rotated to obtain a positive ticket image.

[0086] In this embodiment, the rotation operation on the ticket image can be performed using various existing image processing methods or neural network models for orientation classification, such as residual neural network models. Specifically, based on the ticket image input to the neural network model, a four-classification result related to the orientation of the ticket image can be output. The four-classification result is judged, and for ticket images whose judgment result is non-positive, the image is rotated to obtain a positive ticket image.

[0087] This application reduces the requirement for the orientation of the acquired ticket image by rotating the ticket image, and avoids text recognition errors caused by the orientation of the ticket, thereby improving the applicability of the ticket text recognition method and the overall accuracy of text recognition.

[0088] In some optional implementations of this embodiment, after obtaining the ticket image to be recognized in step 201 and before performing text recognition on the ticket image using a preset recognition and positioning model in step 202, the electronic device may also perform the following steps:

[0089] The ticket image is input into a preset semantic segmentation model to obtain the corresponding mask image.

[0090] Extract the boundaries of the connected components of the mask image and define the minimum bounding rectangle region of the boundary.

[0091] The area outside the rectangular region of the ticket image is filled with white.

[0092] In this embodiment, the accuracy of subsequent text recognition and merging of the ticket image is often affected by image noise, border background, or the presence of multiple bills in the same ticket image (e.g., multiple bills with consecutive numbers in the same ticket image). Therefore, the ticket image can be preprocessed using the above steps to obtain a ticket image after removing background information and splitting it.

[0093] This application preprocesses the invoice image by inputting it into a preset semantic segmentation model, which removes background information from the invoice image and splits the case where there are multiple invoices in the same invoice image, thereby facilitating subsequent text recognition and improving the overall text recognition accuracy.

[0094] Continue to refer to Figure 4 The diagram shows a flowchart of one embodiment of the document text recognition method according to this application. In some optional implementations of this embodiment, after step S202, when text recognition of the document image is performed using a preset recognition and localization model, and before step 203, when named entity extraction of the text information is performed using a preset multimodal transformer model, the electronic device may further perform the following steps:

[0095] Step S209: Determine whether there are synonyms in the text information by querying a preset thesaurus.

[0096] Step S210: When it is determined that there are synonyms, the text information is replaced with synonyms.

[0097] In this embodiment, a pre-defined thesaurus can be consulted to determine whether any of the textual information of the entity to be extracted contains pre-defined synonyms. If synonyms exist, the textual information is replaced with synonyms, such as replacing "payment date," "settlement date," and "bill date" with "charge date." By replacing these textual information with synonyms, named entities can be extracted more accurately, reducing the impact of the long-tail effect. Specifically, a replacement vocabulary template library can be pre-established to determine different expressions for the same replacement field. In practice, the template library can be continuously expanded based on the online entry of various types of invoices to improve the adaptability of the template for replacing various types of textual information. In addition, a pre-defined neural network model can also be used to replace textual information with synonyms.

[0098] In some optional implementations of this embodiment, after step S206, where the entity pairs determined to be related are merged, the electronic device may further perform the following steps:

[0099] Based on preset judgment rules, a target judgment result for the credibility of merged text is generated.

[0100] In this embodiment, the above-mentioned preset rules can be set in a targeted manner according to different application scenarios. For example, in cases such as determining whether the patient's name is consistent with the name of the person at risk; determining whether multiple identical dates are extracted; or determining whether the total amount of expenses is consistent with the total amount of sums, a confidence level of 1 can be assigned to a result that is true.

[0101] This embodiment generates a target judgment result of the credibility of merged text, which can ignore high confidence fields during manual verification and reduce the cost of manual review.

[0102] In some optional implementations of this embodiment, after step S202, where the entity pairs determined to be related are merged, the electronic device may further perform the following steps:

[0103] The ticket image is input into the recognition and localization model for text recognition, and the text confidence rate is obtained.

[0104] The text information is input into a multimodal transformer model for named entity extraction, and the named confidence rate is obtained.

[0105] The entity pairs are input into the association judgment model to determine whether an association exists, and the judgment confidence rate is obtained.

[0106] The confidence target of the merged text is generated by averaging or weighting at least two of the text confidence rate, the named confidence rate, and the judgment confidence rate.

[0107] When the target judgment result exceeds the preset threshold, the merged text is output directly.

[0108] In this embodiment, the confidence rates of multiple models can be combined to generate a target judgment result for the combined text credibility. In one embodiment, in addition to combining at least two of the text confidence rate, naming confidence rate, and judgment confidence rate to generate a target judgment result for the combined text credibility, the confidence rates output by other models used in the document text recognition process of this application, such as the semantic segmentation model and the directional classification model mentioned in the above embodiment, can also be added, and a confidence score can be assigned by combining the various confidence rates.

[0109] This embodiment generates a target judgment result for the credibility of merged text. When the target judgment result (confidence level) exceeds a preset threshold, the merged text is directly output. When the target judgment result does not exceed the preset threshold, a verification result trigger interface can be subsequently displayed to the user. This allows the user to further verify the target judgment result and make corresponding trigger selections based on the verification result. If a user-triggered verification pass signal is received, the merged text is output. If a user-triggered verification fail signal is received, the merged text is deleted, and the two named entities before merging are output separately. This method allows high-confidence fields to be ignored during subsequent manual verification, reducing manual review costs. In addition, for fields that cannot be judged based on the above-mentioned preset judgment rules, a confidence level can be comprehensively assigned by averaging or weighted averaging the text confidence rate, named confidence rate, and judgment confidence rate output by the above-mentioned identification and positioning model, multimodal transformer model, and association judgment model, respectively.

[0110] It should be emphasized that, in order to further ensure the privacy and security of the aforementioned ticket images and text information, the aforementioned ticket images and text information can also be stored in a blockchain node.

[0111] The blockchain referred to in this application is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.

[0112] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0113] Artificial intelligence (AI) foundational technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly include computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning. Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by instructing related hardware with computer-readable instructions, which can be stored in a computer-readable storage medium. When executed, the program can include the processes of the embodiments described above. The aforementioned storage medium can be a non-volatile storage medium such as a magnetic disk, optical disk, or read-only memory (ROM), or random access memory (RAM).

[0114] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0115] Further reference Figure 5 As a response to the above Figure 2 The implementation of the method shown in this application provides an embodiment of a document text recognition device, which is similar to... Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0116] like Figure 5 As shown, the document text recognition device 400 described in this embodiment includes: an acquisition module 401, a recognition module 402, an extraction module 403, a construction module 404, a judgment module 405, and a merging module 406. Wherein:

[0117] The acquisition module 401 is used to acquire the image of the ticket to be identified.

[0118] The recognition module 402 is used to perform text recognition on the ticket image through a preset recognition and positioning model to obtain text information and corresponding layout information.

[0119] The extraction module 403 is used to extract named entities from text information using a preset multimodal transformer model to obtain multiple corresponding named entities.

[0120] Module 404 is used to construct a candidate set of entity pairs based on preset pairing rules, combining multiple named entities and layout information.

[0121] The judgment module 405 is used to judge whether there is a relationship between each entity pair by using a preset association judgment model.

[0122] The merging module 406 is used to merge entity pairs that are determined to be related, and obtain merged text.

[0123] In this embodiment, text information can be directly used as input to a preset multimodal transformer model; alternatively, ticket image, text information, and layout information can be input into the preset multimodal transformer model to output multiple named entities. By inputting ticket image, text information, and layout information into the preset multimodal transformer model, more dimensions of reference information can be provided for entity extraction, thereby improving the accuracy of entity extraction.

[0124] The invoice text recognition device of this application can preset different pairing rules based on different application scenarios. Taking medical bills as an example, it can add entity pairs of the same entity type, such as name, quantity, unit or amount, and which overlap in the X-axis direction to the candidate relationship set; and add entity pairs of different expense item entity types, but which are close in the Y-axis direction, to the candidate relationship set.

[0125] This embodiment performs text recognition on the ticket image, then extracts entities from the recognized text information, and constructs a candidate set of entity pairs by combining multiple extracted named entities and layout information. By determining whether each entity pair is related, related entity pairs are merged, thereby effectively merging highly related text information and improving the accuracy of text recognition on tickets in cases of folding, partial field wrapping, etc.

[0126] In some optional implementations of this embodiment, the ticket text recognition device further includes a deflection judgment module and a rotation module.

[0127] The deflection detection module is used to determine whether there is deflection in the ticket image.

[0128] The rotation module is used to rotate the ticket image if there is any deflection, so as to obtain a positive ticket image.

[0129] This embodiment reduces the requirement for the orientation of the acquired ticket image by rotating the ticket image, and avoids text recognition errors caused by the orientation of the ticket, thereby improving the applicability of the ticket text recognition method of this application and the overall accuracy of text recognition.

[0130] In some optional implementations of this embodiment, the invoice text recognition device further includes a preprocessing module, a setting module, and a filling module.

[0131] The preprocessing module is used to input the ticket image into a preset semantic segmentation model to obtain the corresponding mask image.

[0132] The setting module is used to extract the boundaries of connected components in the mask image and to set the minimum bounding rectangle region of the boundaries.

[0133] The fill module is used to fill the area outside the rectangular region of the ticket image with white.

[0134] This embodiment preprocesses the invoice image using a preset semantic segmentation model. This method can remove background information from the invoice image and split the case of multiple invoices in one image, thereby facilitating subsequent text recognition and improving the overall accuracy of text recognition.

[0135] In some optional implementations of this embodiment, the invoice text recognition device further includes a replacement judgment module and a replacement module.

[0136] The replacement judgment module is used to determine whether there are synonyms in the text information by querying a preset thesaurus.

[0137] The replacement module is used to replace text information with synonyms when synonyms are detected.

[0138] In this embodiment, a thesaurus can be pre-established to determine different ways of expressing the same replacement field; alternatively, a preset neural network model can be used to perform synonym replacement on text information.

[0139] This embodiment standardizes and replaces keywords, enabling more accurate extraction of named entities and thus reducing the impact of the long-tail effect.

[0140] In some optional implementations of this embodiment, the invoice text recognition device further includes a first calculation module, which is used to generate a target judgment result of the credibility of the merged text based on preset judgment rules.

[0141] This embodiment generates a target judgment result of the credibility of merged text, which can ignore high confidence fields during subsequent manual verification, thereby reducing the cost of manual review.

[0142] In some optional implementations of this embodiment, the invoice text recognition device further includes a text acquisition module, a naming acquisition module, a judgment acquisition module, a second calculation module, and a text output module.

[0143] The text acquisition module is used to input the ticket image into the recognition and localization model for text recognition and to obtain the text confidence rate.

[0144] The naming acquisition module is used to input text information into the multimodal transformer model for named entity extraction and to obtain the naming confidence rate.

[0145] The judgment acquisition module is used to input entity pairs into the association judgment model to judge whether an association exists and obtain the judgment confidence rate.

[0146] The second calculation module is used to average or weighted average at least two of the text confidence rate, named confidence rate and judgment confidence rate to generate the credibility target judgment result of the merged text.

[0147] The text output module is used to directly output the merged text when the target judgment result exceeds a preset threshold.

[0148] In this embodiment, for fields that cannot be judged based on preset judgment rules, a confidence level can be comprehensively assigned by averaging or weighted averaging the confidence rates output by the aforementioned identification and positioning model, multimodal transformer model, and association judgment model. In one embodiment, the aforementioned multiple models may also include the semantic segmentation model, orientation classification model, etc., mentioned in the above embodiments.

[0149] To address the aforementioned technical problems, embodiments of this application also provide a computer device. Please refer to [link / reference needed]. Figure 6 , Figure 6 This is a basic structural block diagram of the computer device in this embodiment.

[0150] The computer device 6 includes a memory 61, a processor 62, and a network interface 63 that are interconnected via a system bus. It should be noted that only the computer device 6 with components 61-63 is shown in the figure; however, it should be understood that it is not required to implement all the shown components, and more or fewer components can be implemented alternatively. Those skilled in the art will understand that the computer device described here is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions, and its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0151] The computer device can be a desktop computer, laptop, handheld computer, or cloud server, etc. The computer device can interact with the user via a keyboard, mouse, remote control, touchpad, or voice control.

[0152] The memory 61 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as the hard disk or memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device 6. Of course, the memory 61 may also include both the internal storage unit and its external storage device of the computer device 6. In this embodiment, the memory 61 is typically used to store the operating system and various application software installed on the computer device 6, such as computer-readable instructions for a ticket recognition method. In addition, the memory 61 can also be used to temporarily store various types of data that have been output or will be output.

[0153] In some embodiments, the processor 62 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is used to execute computer-readable instructions stored in the memory 61 or to process data, for example, to execute computer-readable instructions for the ticket recognition method.

[0154] The network interface 63 may include a wireless network interface or a wired network interface, which is typically used to establish communication connections between the computer device 6 and other electronic devices.

[0155] This application performs text recognition on the invoice image, then extracts entities from the recognized text information, and constructs a candidate set of entity pairs by combining multiple extracted named entities and layout information. By judging whether each entity pair is related, the related entity pairs are merged, thereby effectively merging highly related text information and improving the accuracy of text recognition on invoices in cases of folding, partial field wrapping, etc.

[0156] This application also provides another embodiment, namely, providing a computer-readable storage medium storing computer-readable instructions that can be executed by at least one processor to cause the at least one processor to perform the steps of the ticket recognition method described above.

[0157] This application performs text recognition on the invoice image, then extracts entities from the recognized text information, and constructs a candidate set of entity pairs by combining multiple extracted named entities and layout information. By judging whether each entity pair is related, the related entity pairs are merged, thereby effectively merging highly related text information and improving the accuracy of text recognition on invoices in cases of folding, partial field wrapping, etc.

[0158] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0159] Obviously, the embodiments described above are only some embodiments of this application, not all embodiments. The accompanying drawings show preferred embodiments of this application, but do not limit the patent scope of this application. This application can be implemented in many different forms; rather, the purpose of providing these embodiments is to provide a more thorough and comprehensive understanding of the disclosure of this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this application's specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the scope of patent protection of this application.

Claims

1. A method for recognizing invoice text, characterized in that, Includes the following steps: Obtain the image of the ticket to be identified; The document image is subjected to text recognition using a preset recognition and positioning model to obtain text information and corresponding layout information; The text information is extracted using a pre-defined multimodal transformer model to obtain multiple corresponding named entities; Based on preset pairing rules, a candidate set of entity pairs is constructed by combining the multiple named entities and the layout information. The preset pairing rules include: adding entity pairs of the same entity type that coincide in the X-axis direction to the candidate set; and adding entity pairs of different cost item entity types that are close in the Y-axis direction to the candidate set. The association judgment model is used to determine whether each entity pair is associated. Specifically, it includes: selecting any two named entities to form an entity pair; concatenating the head node multimodal information, head node entity type information, tail node multimodal information, and tail node entity type information of the entity pair to obtain input features; inputting the input features into the pre-defined association judgment model; and outputting the classification judgment result of whether the entity pair is associated. The association judgment model is a classification model based on a dual affine attention mechanism. The entity pairs that are determined to be related are merged to obtain the merged text.

2. The invoice text recognition method according to claim 1, characterized in that, Before the step of performing text recognition on the ticket image using a preset recognition and positioning model, the method further includes: Determine whether the ticket image is deflected; If there is a deflection, the ticket image is rotated to obtain a positive ticket image.

3. The document text recognition method according to claim 1 or 2, characterized in that, Before the step of performing text recognition on the ticket image using a preset recognition and positioning model, the method further includes: The ticket image is input into a preset semantic segmentation model to obtain the corresponding mask image; Extract the boundaries of the connected components of the mask image, and define the minimum bounding rectangle region of the boundaries; The area outside the rectangular region of the ticket image is filled with white.

4. The document text recognition method according to claim 1 or 2, characterized in that, Before the step of extracting named entities from the text information using a preset multimodal transformer model, the method further includes: The system determines whether the text information contains synonyms by querying a pre-defined thesaurus. When it is determined that a synonym exists, the text information is replaced with the synonym.

5. The document text recognition method according to claim 1 or 2, characterized in that, The step of extracting named entities from the text information using a preset multimodal transformer model specifically includes: The ticket image, the text information, and the layout information are input into the preset multimodal transformer model to extract named entities, thereby obtaining the multiple named entities.

6. The document text recognition method according to claim 1 or 2, characterized in that, After the step of merging the entity pairs that are determined to be related to obtain the merged text, the method further includes: Based on preset judgment rules, a target judgment result for the credibility of the merged text is generated.

7. The document text recognition method according to claim 1 or 2, characterized in that, After the step of merging the entity pairs that are determined to be related, the method further includes: The ticket image is input into the recognition and localization model for text recognition to obtain the text confidence rate. The text information is input into the multimodal transformer model for named entity extraction to obtain the named confidence rate; The entity is input into the association judgment model to determine whether an association exists, and the judgment confidence rate is obtained. The confidence target judgment result of the merged text is generated by averaging or weighting at least two of the text confidence rate, the naming confidence rate, and the judgment confidence rate. When the target judgment result exceeds the preset threshold, the merged text is output directly.

8. A ticket recognition device, characterized in that, include: The acquisition module is used to acquire the image of the ticket to be identified; The recognition module is used to perform text recognition on the ticket image using a preset recognition and positioning model to obtain text information and corresponding layout information; The extraction module is used to extract named entities from the text information using a preset multimodal transformer model to obtain multiple corresponding named entities; The construction module is used to construct a candidate set of entity pairs based on preset pairing rules, combining the multiple named entities and the layout information. The preset pairing rules include: adding entity pairs of the same entity type that coincide in the X-axis direction to the candidate set; and adding entity pairs of different cost item entity types that are close in the Y-axis direction to the candidate set. The judgment module is used to judge whether each entity pair is associated using a preset association judgment model. Specifically, it includes: selecting any two named entities to form an entity pair; concatenating the head node multimodal information, head node entity type information, tail node multimodal information, and tail node entity type information of the entity pair to obtain input features; inputting the input features into the preset association judgment model; and outputting the classification judgment result of whether the entity pair is associated. The association judgment model is a classification model based on a dual affine attention mechanism. The merge module is used to merge entity pairs that are determined to be related, resulting in merged text.

9. A computer device comprising a memory and a processor, the memory storing computer-readable instructions, wherein the processor, when executing the computer-readable instructions, implements the steps of the document text recognition method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the invoice text recognition method as described in any one of claims 1 to 7.