A method for training a multi-modal information extraction model and an information extraction method

By pre-training and fine-tuning the multimodal information extraction model, and combining multimodal pre-training data and entity annotation data, the problem of poor generalization ability in document information extraction is solved, and the accuracy and speed of information extraction are improved, especially the ability to identify cross-line entities in the target domain.

CN115687643BActive Publication Date: 2026-06-12SHANGHAI HONGJI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI HONGJI INFORMATION TECH CO LTD
Filing Date
2022-10-21
Publication Date
2026-06-12

Smart Images

  • Figure CN115687643B_ABST
    Figure CN115687643B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a method for training a multi-modal information extraction model and an information extraction method. The method comprises: pre-training a first multi-modal information extraction model according to multi-modal pre-training data of a target field to obtain a second multi-modal information extraction model, wherein the multi-modal pre-training data is obtained by labeling pre-training data, and the pre-training data is obtained by performing text extraction and text box recognition on each document in a first document set of the target field; fine-tuning the second multi-modal information extraction model according to entity labeling data of the target field to obtain a target multi-modal entity information extraction model, wherein the entity labeling data is obtained by labeling fine-tuning data, and the fine-tuning data is obtained by performing text extraction and text box recognition on each document in a second document set of the target field. The multi-modal information extraction model of some embodiments of the present application has better generalization ability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of information extraction, and specifically, the embodiments of this application relate to a method for training a multimodal information extraction model and an information extraction method. Background Technology

[0002] In recent years, multimodal information extraction has become a research hotspot in academia. Early information extraction methods (such as information extraction in the field of credit reporting) were rule-based. These methods heavily relied on business or engineering personnel pre-determining rules for the information to be extracted, resulting in significant time and manpower costs. While this approach may have high accuracy in practical applications, its poor generalization ability makes it difficult to achieve true application-level performance.

[0003] With the development of deep learning, people have gradually begun to use natural language processing (NLP) and computer vision technologies to extract information from documents (e.g., credit reports or household registration pages). One method is anchor point detection based on fixed templates. This method has achieved good extraction results in business areas such as documents and invoices. However, this method requires the data to have the same or nearly identical layout, and it cannot achieve ideal extraction results for data with large layout variations or page distortion. Another method based on NLP uses OCR (Optical Character Recognition) or document parsing tools to extract text from the document (e.g., credit reports or household registration pages), and then uses traditional entity recognition models for information extraction. This method has good generalization for extracting some fields, such as name, address, and occupation, but its extraction effect is poor for numeric fields that do not have semantic information. This is because these fields rely on the context information of numeric information to determine their corresponding tags.

[0004] Therefore, improving the accuracy and speed of information extraction from documents (e.g., documents that include both text and tables) has become an urgent technical problem to be solved. Summary of the Invention

[0005] The purpose of this application is to provide a method for training a multimodal information extraction model and an information extraction method. The embodiments of this application integrate multimodal features such as image features, layout features, and text features during the information extraction process, and extract information from documents (e.g., credit reports) in the target domain based on a multimodal pre-trained information model, which has better generalization ability than traditional multimodal information extraction models.

[0006] In a first aspect, embodiments of this application provide a method for training a multimodal information extraction model, the method comprising: pre-training a first multimodal information extraction model based on multimodal pre-training data of a target domain to obtain a second multimodal information extraction model, wherein the multimodal pre-training data is obtained by annotating pre-training data, and the pre-training data is obtained by extracting text and recognizing text boxes from each document in a first document set of the target domain; and fine-tuning the second multimodal information extraction model based on entity annotation data of the target domain to obtain a target multimodal entity information extraction model, wherein the entity annotation data is obtained by annotating fine-tuning data, and the fine-tuning data is obtained by extracting text and recognizing text boxes from each document in a second document set of the target domain.

[0007] Some embodiments of this application pre-train the pre-trained model using training data from the target domain, and then fine-tune the model obtained from the re-pre-training based on fine-tuning data to obtain the entity information extraction model for the target domain, thereby improving the generalization ability of the obtained entity information extraction.

[0008] In some embodiments, before pre-training the first multimodal information extraction model based on multimodal pre-training data of the target domain, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set; masking the image regions corresponding to any one or more segments of text in any one document, and labeling the image regions corresponding to the masked text segments as masked and the image regions corresponding to the unmasked text segments as unmasked, to obtain second modality pre-training data.

[0009] Some embodiments of this application can enable the trained model to possess text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and second modality pre-training data.

[0010] In some embodiments, before pre-training the first multimodal information extraction model based on multimodal pre-training data of the target domain, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by text extraction from any one document in the first document set, the first document set includes N documents, where N is an integer greater than 1; replacing the text files in some combinations of N pairs of text files and images with different text files or replacing the images with different images, and labeling the replaced combinations with labels indicating that the text and images are inconsistent, and labeling the unreplaced combinations with labels indicating that the text and images are consistent, to obtain third modality pre-training data, wherein the N pairs of text files and images include N text files and images corresponding to each text file, and the text file is obtained by text extraction from one document in the first document set.

[0011] Some embodiments of this application can enable the trained model to have text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and third modality pre-training data.

[0012] In some embodiments, before pre-training the first multimodal information extraction model based on multimodal pre-training data of the target domain, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set, the first document set including N documents, where N is an integer greater than 1; masking the image region corresponding to any one or more segments of text in any one document, and using the masked text segment as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set, the first document set including N documents, where N is an integer greater than 1; masking the image region corresponding to any one or more segments of text in any one document, and using the image region corresponding to the masked text segment as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set, the first document set including N documents, ... the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set including N documents, the first document set Regions are labeled as covered, and image regions corresponding to uncovered text segments are labeled as uncovered, thus obtaining the second modality pre-training data; text files in some combinations of N pairs of text files and images are replaced with different text files or images are replaced with different images, and the replaced combinations are labeled with inconsistent text and images, while the unreplaced combinations are labeled with consistent text and images, thus obtaining the third modality pre-training data. The N pairs of text files and images include N text files and images corresponding to each text file, and the text files are obtained by extracting text from a document in the first document set.

[0013] Some embodiments of this application can enable the trained model to have text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and third modality pre-training data.

[0014] In some embodiments, pre-training the first multimodal information extraction model based on multimodal pre-training data of the target domain includes: determining whether the training of the first multimodal information extraction model can be terminated based on a target loss value, wherein the target loss value is related to a first loss value obtained through the first modality pre-training data, a second loss value obtained through the second modality pre-training data, and a third loss value obtained through the third modality pre-training data.

[0015] The loss function values ​​of some embodiments of this application are correlated with multi-task loss values ​​to improve the generalization ability of the obtained model.

[0016] In some embodiments, the target loss value is a weighted sum of the first loss value, the second loss value, and the third loss value.

[0017] Some embodiments of this application provide a method for quantifying multi-task loss values, making the calculation of target loss values ​​more objective and accurate.

[0018] In some embodiments, before fine-tuning the second multimodal information extraction model based on entity annotation data of the target domain, the method further includes: acquiring an image of any document in the second document set to obtain a target image; identifying all text from the target image to obtain a target text file, and acquiring text boxes containing each segment of text from the target image; annotating entity boxes containing entities on the target image and obtaining entity labels corresponding to the entity boxes; obtaining entity annotation data based on the entity boxes and the text boxes, wherein the entity annotation data is entity labels assigned to at least the text boxes.

[0019] In some embodiments, obtaining the entity annotation data based on the entity box and the text box includes: if the proportion of the overlapping area between the first text box and the first entity box on the corresponding image is greater than a first threshold, then the entity label corresponding to the first entity box is used as the label corresponding to the first text box.

[0020] Some embodiments of this application obtain entity training data by annotating entity boxes and entity labels corresponding to each text box to complete the annotation of fine-tuning data.

[0021] In some embodiments, the step of marking the entity bounding box of the entity on the target image and obtaining the entity label corresponding to the entity bounding box includes: marking an entity bounding box for an entity that spans a row and assigning an entity label.

[0022] Some embodiments of this application can improve the recognition capability of cross-line entities by annotating multiple entities across rows with an entity bounding box and an entity label.

[0023] Secondly, some embodiments of this application provide a method for entity information extraction, the method comprising: performing entity information extraction based on an image to be extracted, a text file to be extracted, a text box to be extracted, and the target multimodal entity information extraction model, and obtaining a predicted entity information extraction result, wherein the image to be extracted is an image corresponding to the document to be extracted, the text file to be extracted includes a text sequence obtained by extracting text from the image to be extracted, the text box to be extracted is a location box on the image to be extracted where each segment of text is located, and the predicted entity information extraction result includes all target entity fragments extracted from the document to be extracted, entity labels corresponding to the target entity fragments, and entity positions.

[0024] Some embodiments of this application can extract entity information from input documents using a multimodal information extraction model obtained through training.

[0025] In some embodiments, before the entity information extraction is completed based on the image to be extracted, the text file to be extracted, the text box to be extracted, and the target multimodal entity information extraction model, the method further includes: converting the document to be extracted into an image to obtain the image to be extracted; extracting text from the image to be extracted to obtain the text file to be extracted; and identifying the regions covered by each segment of text on the image to be extracted to obtain the text box to be extracted.

[0026] Some embodiments of this application preprocess the document from which the content to be extracted is obtained, and use the text file, the text box containing the text, and the image corresponding to the document as data for the input model, thereby improving the accuracy of the obtained entity extraction results.

[0027] In some embodiments, the method further includes merging multiple target entity fragments belonging to a cross-line structure to obtain an entity object.

[0028] Some embodiments of this application improve the accuracy of entity extraction results by merging identified cross-line entities.

[0029] In some embodiments, merging multiple entity fragments belonging to different rows to obtain an entity object includes: determining that the multiple target entity fragments belong to an entity object to be extracted based at least on the entity tags and entity positions of the multiple target entity fragments.

[0030] Some embodiments of this application determine whether multiple entity fragments correspond to a single entity object by using entity tags and entity locations of multiple entity fragments.

[0031] In some embodiments, determining that the plurality of target entity fragments belong to a single entity object to be extracted based at least on the entity labels and entity positions of the plurality of target entity fragments includes: if it is confirmed that the entity labels of the plurality of target entity fragments are all the same, the entity positions of the plurality of target entity fragments are adjacent, and the merging of all entity labels corresponding to the plurality of target entity fragments satisfies a predetermined annotation specification, then it is confirmed that the plurality of target entity fragments belong to a single entity object.

[0032] Some embodiments of this application define multiple entities with the same label, adjacent entity positions, and entity labels merged to complete a certain annotation specification as one entity, thereby improving the extraction capability for various cross-line entities.

[0033] Thirdly, some embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the methods described in any of the embodiments of the first or second aspect above.

[0034] Fourthly, some embodiments of this application provide an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, can implement the method as described in any embodiment of the first or second aspect.

[0035] Fifthly, some embodiments of this application provide an apparatus for training a multimodal information extraction model, the apparatus comprising: a pre-training module configured to pre-train a first multimodal information extraction model based on multimodal pre-training data of a target domain to obtain a second multimodal information extraction model, wherein the multimodal pre-training data is obtained by annotating pre-training data, and the pre-training data is obtained by extracting text and recognizing text boxes from each document in a first document set of the target domain; and a fine-tuning module configured to fine-tune the second multimodal information extraction model based on entity annotation data of the target domain to obtain a target multimodal entity information extraction model, wherein the entity annotation data is obtained by annotating fine-tuning data, and the fine-tuning data is obtained by extracting text and recognizing text boxes from each document in a second document set of the target domain. Attached Figure Description

[0036] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0037] Figure 1 One of the flowcharts for the method of training a multimodal information extraction model provided in the embodiments of this application;

[0038] Figure 2 A schematic diagram illustrating the pre-training of the first multimodal information extraction model provided in an embodiment of this application;

[0039] Figure 3 A schematic diagram illustrating the fine-tuning of the second multimodal information extraction model provided in an embodiment of this application;

[0040] Figure 4 The second flowchart of the method for training a multimodal information extraction model provided in the embodiments of this application;

[0041] Figure 5 A schematic diagram illustrating the process of using a target multimodal entity information extraction model to complete the actual entity information extraction, as provided in the embodiments of this application.

[0042] Figure 6 A block diagram illustrating the components of an apparatus for training a multimodal information extraction model, as provided in an embodiment of this application.

[0043] Figure 7 This is a schematic diagram of the electronic device provided in the embodiments of this application. Detailed Implementation

[0044] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.

[0045] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0046] With the advent of BERT, pre-trained models and fine-tuning paradigms can achieve excellent results even with limited data. The inventors of this application have discovered that how to utilize widely available documents for model pre-training and then fine-tuning the pre-trained model on target domain documents (e.g., credit reports) to further improve information extraction performance in the target domain is a problem worthy of consideration. For example, taking credit report information extraction as an example, these documents possess rich visual information; incorporating visual features based on computer vision technology will be more beneficial for information extraction from credit reports.

[0047] Some embodiments of this application propose a method for extracting credit report information that integrates visual features, layout features, and text features, based on a pre-trained multimodal information extraction model (i.e., a first multimodal information extraction model). This pre-trained multimodal information model is obtained by pre-training on a large amount of visually rich document data, thus enabling it to learn general text semantic representations and text-image alignment capabilities from documents. Then, credit report data (as document data in the target domain) is used to perform vertical domain pre-training and fine-tuning of the pre-trained multimodal information extraction model, resulting in a second multimodal information extraction model. Some embodiments of this application also propose an optimization method for extracting cross-line entity information. For example, using BIOES (a sequence labeling method in NLP) annotation, cross-line entity fragments are merged during the inference stage, thereby achieving accurate extraction of cross-line entities.

[0048] The following section first illustrates, by way of example, the process of re-pre-training and fine-tuning the pre-trained multimodal information extraction model to obtain the target multimodal entity information extraction model.

[0049] Please refer to Figure 1 , Figure 1 This application provides a method for training a multimodal information extraction model, the method comprising:

[0050] S101, the first multimodal information extraction model is pre-trained based on the multimodal pre-training data of the target domain to obtain the second multimodal information extraction model. The multimodal pre-training data is obtained by annotating the pre-training data, and the pre-training data is obtained by extracting text and recognizing text boxes from each document in the first document set of the target domain.

[0051] It should be noted that the first multimodal information extraction mode is obtained by pre-training the multimodal information extraction model using pre-trained data from a general domain. For example, the first multimodal information extraction mode is obtained by pre-training the multimodal information extraction model with a large amount of visually rich document data. This first multimodal information extraction model has the ability to learn general text semantic representations and text-image alignment from documents.

[0052] like Figure 2 As shown, S101 includes, for example, inputting multimodal pre-training data into a first multimodal information extraction model to train the model, and obtaining a second multimodal information extraction model after training.

[0053] In some embodiments of this application, the multimodal pre-training data described in S101 includes: first modal pre-training data for mining text features and second modal pre-training data for mining text distribution features. In some embodiments of this application, the multimodal pre-training data described in S101 includes: first modal pre-training data for mining text features and third modal pre-training data for mining text distribution features. In some embodiments of this application, the multimodal pre-training data described in S101 includes: first modal pre-training data for mining text features, second modal pre-training data for mining text distribution features, and third modal pre-training data.

[0054] The following example illustrates a method for obtaining multimodal pre-training data.

[0055] For example, in some embodiments of this application, before S101, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set; masking the image regions corresponding to any one or more segments of text in any one document, and labeling the image regions corresponding to the masked text segments as masked and the image regions corresponding to the unmasked text segments as unmasked, to obtain second modality pre-training data.

[0056] Some embodiments of this application can enable the trained model to possess text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and second modality pre-training data.

[0057] For example, in some embodiments of this application, before S101, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set, the first document set includes N documents, where N is an integer greater than 1; replacing the text files in some combinations of N pairs of text files and images with different text files or replacing the images with different images, and labeling the replaced combinations with labels indicating that the text and images are inconsistent, and labeling the unreplaced combinations with labels indicating that the text and images are consistent, to obtain third modality pre-training data, wherein the N pairs of text files and images include N text files and images corresponding to each text file, and the text file is obtained by extracting text from one document in the first document set.

[0058] Some embodiments of this application can enable the trained model to have text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and third modality pre-training data.

[0059] For example, in some embodiments of this application, before S101, the method further includes: masking any one or more segments of text in a text file corresponding to any one document, and using the masked text as a label to construct first modality pre-training data, wherein the text file is obtained by extracting text from any one document in the first document set, the first document set includes N documents, where N is an integer greater than 1; masking the image regions corresponding to any one or more segments of text in any one document, and labeling the image regions corresponding to the masked text segments as masked, and labeling the image regions corresponding to the unmasked text segments as unmasked, to obtain second modality pre-training data; replacing the text files in some combinations of N pairs of text file and image combinations with different text files or replacing the images with different images, and labeling the replaced combinations with labels indicating that the text and images are inconsistent, and labeling the unreplaced combinations with labels indicating that the text and images are consistent, to obtain third modality pre-training data, wherein the N pairs of text file and image combinations include N text files and images corresponding to each text file, and the text file is obtained by extracting text from a document in the first document set.

[0060] Some embodiments of this application can enable the trained model to have text learning capabilities and document layout learning capabilities in the target domain by constructing first modality pre-training data and third modality pre-training data.

[0061] It is easy to understand that if the multimodal pre-training data includes first modality pre-training data, second modality pre-training data, and third modality pre-training data, the pre-training effect of the first multimodal information extraction model is better than that of other embodiments.

[0062] It should be noted that, in order to obtain the aforementioned multimodal pre-training data, each document in the first document set needs to be converted into an image first, and text extraction is performed from the converted images to obtain the text file corresponding to the document (the extracted text sequence). Then, the position of each text segment is identified from the converted images to obtain the corresponding text box. The first modality pre-training data involves masking a certain text segment in the text sequence and using the masked text segment as the label of the masked part. The second modality pre-training data involves masking the image region corresponding to a certain text segment (this region is determined based on the text box).

[0063] To determine whether the pre-training process of the first multimodal information extraction model can be terminated, multiple loss values ​​need to be obtained using the labeled pre-training data for each modality.

[0064] In some embodiments of this application, step S101 includes: determining whether training of the first multimodal information extraction model can be terminated based on a target loss value, wherein the target loss value is related to a first loss value obtained through the first modality pre-training data, a second loss value obtained through the second modality pre-training data, and a third loss value obtained through the third modality pre-training data. In some embodiments of this application, the loss function value is correlated with the multi-task loss value, thus improving the generalization ability of the obtained model.

[0065] For example, in some embodiments of this application, the target loss value is a weighted sum of the first loss value, the second loss value, and the third loss value. Some embodiments of this application provide a method for quantifying multi-task loss values, making the calculation of the target loss value more objective and accurate.

[0066] The following is combined Figure 4 The above training process is illustrated by taking a document in the target domain that belongs to a credit report as an example.

[0067] S111: Collect an appropriate amount of credit report documents and obtain pre-training data and fine-tuning data based on the credit report documents.

[0068] Collect a suitable amount of credit report documents, convert the documents into images, and use OCR or document parsing tools to extract the text from the credit report documents (to obtain text files corresponding to the text sequence) and the rectangles corresponding to the text (i.e., the text boxes corresponding to the recognized text). Then divide them into vertical domain pre-training data and fine-tuning data with a division ratio of 95:5, denoted as data A and data B, respectively.

[0069] S112, label the pre-training data to obtain multimodal pre-training data.

[0070] The following three annotation operations are performed on data A to obtain multimodal pre-training data: First, the text in each text file included in data A (i.e., the text sequence obtained by converting a credit report document into an image and then extracting text from the image) is randomly masked, and the masked text is used as the label to obtain the first modality pre-training data. Second, a segment of text is randomly selected from each text file, and the corresponding image region (which can be determined by the coordinates of the text box) is masked. Specifically, the pixel value of the region is set to 0, and the label of the text is set to "masked". For text whose image region is not masked, the label is set to "unmasked". Third, the text corresponding to each credit report document and the image converted from that document are matched one-to-one. Then, the image in the text-image pair is replaced with any other arbitrary image (or the text file is replaced with any other arbitrary text file with a certain probability). The consistency of the text-image pair is used as the label, and the constructed data is denoted as data C (i.e., the multimodal pre-training data is constructed).

[0071] S113, based on the open-source LayoutXLM pre-trained language model (as a specific example of the first multimodal information extraction model, it is understood that this model can also be replaced by DiT or StrucTexT), uses data C to pre-train the LayoutXLM pre-trained language model in the vertical domain, and outputs model A (as a specific example of the second multimodal information extraction model).

[0072] S102, the second multimodal information extraction model is fine-tuned based on the entity annotation data of the target domain to obtain the target multimodal entity information extraction model. The entity annotation data is obtained by annotating the fine-tuned data, and the fine-tuned data is obtained by extracting text and recognizing text boxes from each document in the second document set of the target domain.

[0073] like Figure 3 As shown, S102 includes inputting entity annotation data into a second multimodal information extraction model, fine-tuning the model, and obtaining a target multimodal entity information extraction model after fine-tuning.

[0074] It is understandable that in order to fine-tune the model, it is necessary to first obtain entity annotation data.

[0075] In some embodiments of this application, prior to S102, the method further includes: acquiring an image of any document in the second document set to obtain a target image; identifying all text from the target image to obtain a target text file, and acquiring text boxes containing each segment of text from the target image; annotating entity boxes containing entities on the target image and obtaining entity labels corresponding to the entity boxes; and obtaining entity annotation data based on the entity boxes and the text boxes, wherein the entity annotation data is entity labels assigned to at least the text boxes. For example, obtaining the entity annotation data based on the entity boxes and the text boxes includes: if the proportion of the overlapping area between the first text box and the first entity box on the corresponding image is greater than a first threshold, then the entity label corresponding to the first entity box is used as the label corresponding to the first text box.

[0076] In other words, in some embodiments of this application, the process of obtaining entity annotation data includes, for example,: obtaining text boxes containing each segment of text from a text file corresponding to any document in the second document set, wherein the text file is obtained by extracting text from an image corresponding to any document in the second document set; annotating the entity boxes containing entities and the entity labels corresponding to the entity boxes in the text file; if the proportion of the overlapping area between the first text box and the first entity box on the corresponding image is greater than a first threshold, then the entity label corresponding to the first entity box is used as the label corresponding to the first text box; repeating the above process to annotate entity boxes for each document in the second document set and confirming the annotation labels of each text box to obtain the entity annotation data. Some embodiments of this application obtain entity training data by annotating entity boxes and the entity labels corresponding to each text box to complete the annotation of fine-tuning data.

[0077] It should be noted that, in order to improve the ability to recognize entities spanning multiple rows, in some embodiments of this application, the step of marking the entity bounding box of the entity on the target image and obtaining the entity label corresponding to the entity bounding box includes: marking an entity bounding box and assigning an entity label for one entity spanning multiple rows. Some embodiments of this application can improve the ability to recognize entities spanning multiple rows by marking an entity bounding box and assigning an entity label.

[0078] The following is combined Figure 4 The above process is illustrated by taking the model fine-tuning corresponding to the credit report as an example.

[0079] like Figure 4 As shown, S102 includes, for example:

[0080] S114, Construct the fine-tuning training data required for model fine-tuning (as a specific example of entity annotation data).

[0081] To build the training data needed for model fine-tuning, data B is labeled with entities using manual annotation (or machine annotation, etc.), and the annotation results are exported. This data is denoted as data D (as an example of entity annotation data).

[0082] In step S115, the fine-tuning training data is divided into a training set and a validation set. The training set is used to further fine-tune the model, and the validation set is used to verify the model's performance. In other words, the data D from S114 is divided into training and validation sets. Based on model A from S113, the training set is used to further fine-tune the model, and the validation set is used to verify the model's performance. The optimal model is selected as the final credit report information extraction model, resulting in the target multimodal entity information extraction model.

[0083] It is understandable that S114 above is a step following S113.

[0084] Some embodiments of this application pre-train the pre-trained model using training data from the target domain, and then fine-tune the model obtained from the re-pre-training based on fine-tuning data to obtain the entity information extraction model for the target domain, thereby improving the generalization ability of the obtained entity information extraction.

[0085] The following example illustrates how the above-mentioned target multimodal entity information extraction model can complete the actual entity information extraction.

[0086] Some embodiments of this application provide a method for entity information extraction. The method includes: extracting entity information based on an image to be extracted, a text file to be extracted, text boxes to be extracted, and a target multimodal entity information extraction model to obtain a predicted entity information extraction result. The image to be extracted is an image corresponding to the document to be extracted; the text file to be extracted includes a text sequence obtained by extracting text from the image to be extracted; the text boxes to be extracted are the location boxes of each text segment determined on the image to be extracted; and the predicted entity information extraction result includes all target entity fragments extracted from the document to be extracted, entity labels corresponding to the target entity fragments, and entity positions. Some embodiments of this application can extract entity information from an input document using a trained multimodal information extraction model.

[0087] like Figure 5 As shown, the document to be extracted (e.g., a credit report) is obtained; the document to be extracted is input into the preprocessing module for preprocessing to obtain a text sequence (corresponding to the text file to be extracted), a text box containing the text (corresponding to the text box to be extracted), and an image (corresponding to the image to be extracted); then the text sequence, the text box containing the text, and the image are input into the target multimodal entity information extraction model to obtain the predicted entity information extraction result.

[0088] The following example illustrates how to preprocess the extracted document.

[0089] In some embodiments of this application, before completing entity information extraction based on the image to be extracted, the text file to be extracted, the text box to be extracted, and the target multimodal entity information extraction model, the method further includes: converting the document to be extracted into an image to obtain the image to be extracted; extracting text from the image to obtain the text file to be extracted; and identifying the regions covered by each segment of text on the image to obtain the text box to be extracted. Some embodiments of this application preprocess the document from which content needs to be extracted to obtain the text file, the text box containing the text, and the image corresponding to the document as input data to the model, thereby improving the accuracy of the obtained entity extraction results.

[0090] To improve the ability to extract cross-line entities, some embodiments of this application also include a step of cross-line entity processing on the prediction results obtained by the target multimodal entity information extraction model.

[0091] like Figure 5 As shown, the cross-line entity processing module performs cross-line merging processing on the prediction results obtained by the target multimodal entity information extraction model to obtain all extracted entities of interest.

[0092] In other words, in some embodiments of this application, the entity information extraction method further includes merging multiple target entity fragments belonging to different rows to obtain an entity object. Some embodiments of this application improve the accuracy of the entity extraction results by merging identified cross-row entities.

[0093] For example, in some embodiments of this application, merging multiple target entity fragments belonging to different rows to obtain a single entity object includes: determining, at least based on the entity tags and entity positions of the multiple target entity fragments, that the multiple target entity fragments belong to a single entity object to be extracted. Some embodiments of this application determine whether multiple entity fragments correspond to a single entity object by using the entity tags and entity positions of the multiple entity fragments.

[0094] For example, in some embodiments of this application, determining that the multiple target entity fragments belong to a single entity object to be extracted based at least on their entity labels and entity positions includes: if it is confirmed that the entity labels of the multiple target entity fragments are all the same, the entity positions of the multiple target entity fragments (i.e., the positions of the boxes corresponding to the determined entity fragments) are adjacent, and merging all entity labels corresponding to the multiple target entity fragments satisfies a predetermined annotation specification, then it is confirmed that the multiple target entity fragments belong to a single entity object. Some embodiments of this application classify multiple entities with the same labels, adjacent entity positions, and entity labels merged to meet a certain annotation specification as a single entity, thus improving the extraction capability for various cross-line entities.

[0095] The following example uses a document from the target domain, namely a credit report, to illustrate the entire process of training and entity extraction.

[0096] In the field of credit reporting, data presents rich semantic, layout, and visual information. Therefore, using a pre-trained model that integrates these modalities can better model the information extraction task in credit reporting. The LayoutXLM model models the semantic, layout, and visual information of the data, making it very suitable for application in the information extraction task of credit reporting. LayoutXLM is a multilingual, multimodal pre-trained language model that has been pre-trained on a large number of publicly available visually rich document datasets and has achieved good results in information extraction from tabular or visually rich document data. However, since the dataset used for pre-training this model differs significantly from credit reporting data, directly using the model for fine-tuning as in related technologies does not yield ideal results. Therefore, some embodiments of this application require vertical domain pre-training on the pre-trained model using credit reporting data before fine-tuning. The following exemplarily illustrates the process of re-pre-training and fine-tuning in some embodiments of this application.

[0097] First, it is necessary to collect as much credit report data as possible. In view of the accuracy of model evaluation during the fine-tuning stage, the collected dataset needs to be divided into vertical domain pre-training data and fine-tuning data in a ratio of 95:5, which are denoted as data A and data B respectively (refer to step S111 above).

[0098] Secondly, data A is further processed by randomly masking the text in data A and using the masked text as the label for the pre-training target. Text is randomly selected, and the corresponding image region is covered by setting the pixel value of the region to 0. The label of the text is then set to "covered". For text whose image region is not covered, the label is set to "uncovered". The images of the text-image pairs are replaced with other images, and the consistency between the text and the images is used as the label. The data with the better data is recorded as data C.

[0099] Next, the LayoutXLM model was pre-trained in the vertical domain using data C. The model loss was the sum of the losses of each subtask in the pre-training (that is, the sum of the loss values ​​obtained from the three modal pre-training data respectively). The total number of training steps was 50,000.

[0100] Next, data B is manually labeled, and some fields in the credit report are labeled according to business needs. For example, labelme (a labeling tool) is used to label entity regions (obtain entity boxes) on the image data converted from the credit report. For entities that span multiple rows, only one label box is used (i.e., label one entity box) to label the entities that span multiple rows. This labeled data is denoted as data D.

[0101] To obtain the bounding box (text box) information of the text in the data, it is necessary to extract the text from the credit report document. For parsable credit report documents, document parsing tools are used directly to obtain the text and its bounding box information. For unparsable credit report documents, OCR is used to recognize the unparsable document and extract the text and its bounding box information. Since the width and height are proportionally expanded during the document-to-image conversion process, it is necessary to convert the bounding box (entity box) information of data D and the extracted text bounding boxes to the same scaling ratio according to the expansion ratio.

[0102] The extracted text bounding boxes (i.e., text boxes) are matched with the bounding box information (i.e., entity boxes) of data D. Specifically, the overlap factor between the text bounding box and the bounding box of data D is calculated. If the overlap factor is greater than 0.5, the entity label of data D is used as the text label. For entities that span multiple lines, the overlap factor between the text bounding box and the bounding box of data D may be less than 0.5, but if the text bounding box is completely enclosed within the bounding box of data D, the label of data D is still used as the text label. Based on the above method, a label is assigned to each extracted text (corresponding to an entity is an entity label). The type of entity label can be: address, name, gender, etc.

[0103] Here, the BIOES method is used to tag entities. For entities that span multiple lines, such as "Shanghai Gender\nPudong Male\nSanlin Town", this entity spans three lines in the document, and its tag is "B-ADDRESS I-ADDRESS OO\n I-ADDRESS I-ADDRESS O\n I-ADDRESS I-ADDRESS E-ADDRESS". This constructed data is denoted as data D. It should be noted that "B-ADDRESS I-ADDRESS OO\n I-ADDRESS I-ADDRESS O\n I-ADDRESSI-ADDRESS E-ADDRESS" is the entity tag corresponding to the multi-line entity Shanghai Pudong Sanlin Town. In this entity tag, 0 represents non-address information (e.g., text such as male or gender).

[0104] Then, the data D is divided into a training set and a validation set. The pre-trained model is fine-tuned based on the training set, specifically by calculating the label for each position in the input text sequence. After fine-tuning, the model with the highest F1 score on the validation set is selected and saved as the node for the final application model.

[0105] It should be noted that during the model prediction phase, since entities spanning multiple lines are distributed across different segments of the input text sequence, the system determines whether entity segments belong to the same entity based on the predicted entity's label and similar positions. Specifically, if they are located at similar y-coordinates (as an example of whether entity positions are adjacent), have the same entity label, and the merged entity segments meet the BIOES annotation specifications (as an example of a predefined annotation specification), then these entity segments can be considered to belong to the same entity and need to be extracted and merged for output.

[0106] Please refer to Figure 6 , Figure 6The apparatus for training a multimodal information extraction model provided in the embodiments of this application is shown. It should be understood that this apparatus is similar to the one described above. Figure 1 Corresponding to the method embodiments, it can execute the various steps involved in the above method embodiments. The specific functions of the device can be found in the description above. To avoid repetition, detailed descriptions are appropriately omitted here. The device includes at least one software function module that can be stored in the memory or embedded in the device's operating system in the form of software or firmware. The device for training a multimodal information extraction model includes: a pre-training module 601 and a fine-tuning module 602.

[0107] The pre-training module 601 is configured to pre-train a first multimodal information extraction model based on multimodal pre-training data of the target domain to obtain a second multimodal information extraction model. The multimodal pre-training data is obtained by annotating pre-training data and is obtained by extracting text and recognizing text boxes from each document in the first document set of the target domain.

[0108] The fine-tuning module 602 is configured to fine-tune the second multimodal information extraction model based on the entity annotation data of the target domain to obtain a target multimodal entity information extraction model. The entity annotation data is obtained by annotating the fine-tuning data, and the fine-tuning data is obtained by extracting text and recognizing text boxes from each document in the second document set of the target domain.

[0109] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working process of the device described above can be referred to the corresponding process in the aforementioned method, and will not be elaborated further here.

[0110] Some embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the method described in any of the embodiments of the above-described method for training a multimodal information extraction model or the above-described method for entity information extraction.

[0111] like Figure 7 As shown, some embodiments of this application provide an electronic device 700, including a memory 710, a processor 720, and a computer program stored in the memory 710 and executable on the processor 720. When the processor 720 reads the program from the memory 710 via a bus 730 and executes the program, it can implement the method described in any embodiment of the above-described method for training a multimodal information extraction model or the above-described method for entity information extraction.

[0112] Processor 720 can process digital signals and can include various computing architectures. For example, it can be a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements multiple instruction set combinations. In some examples, processor 720 can be a microprocessor.

[0113] The memory 710 can be used to store instructions executed by the processor 720 or data related to the execution of instructions. These instructions and / or data may include code used to implement some or all of the functions of one or more modules described in the embodiments of this application. The processor 720 of the embodiments of this disclosure can be used to execute the instructions in the memory 710 to implement… Figure 1 The method shown. Memory 710 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory well known to those skilled in the art.

[0114] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can also be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0115] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0116] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0117] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application. It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.

[0118] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

[0119] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

Claims

1. A method for training a multimodal information extraction model, characterized in that, The method includes: Before pre-training the first multimodal information extraction model based on multimodal pre-trained data from the target domain, the method further includes: Cover any one or more segments of text in a text file corresponding to any one document, and use the covered text as a label to construct the first modality pre-training data. The text file is obtained by extracting text from any one of the documents in the first document set. The first document set includes N documents, where N is an integer greater than 1. The text files in some combinations of N pairs of text files and images are replaced with different text files or the images are replaced with different images. The replaced combinations are labeled with tags indicating that the text and image are inconsistent, while the unreplaced combinations are labeled with tags indicating that the text and image are consistent. This process yields the third modality pre-training data. The N pairs of text files and images include N text files and the corresponding images for each text file. The text files are obtained by extracting text from documents in the first document set. The first multimodal information extraction model is pre-trained based on the multimodal pre-training data of the target domain to obtain the second multimodal information extraction model. The multimodal pre-training data is obtained by annotating the pre-training data and by extracting text and recognizing text boxes from each document in the first document set of the target domain. The second multimodal information extraction model is fine-tuned based on the entity annotation data of the target domain to obtain the target multimodal entity information extraction model. The entity annotation data is obtained by annotating the fine-tuning data, and the fine-tuning data is obtained by extracting text and recognizing text boxes from each document in the second document set of the target domain.

2. The method as described in claim 1, characterized in that, Before pre-training the first multimodal information extraction model based on multimodal pre-trained data from the target domain, the method further includes: Cover any one or more segments of text in a text file corresponding to any one document, and use the covered text as a label to construct the first modality pre-training data, wherein the text file is obtained by extracting text from any one of the documents in the first document set. The image regions corresponding to any one or more text segments in any document are covered, and the image regions corresponding to the covered text segments are labeled as covered, while the image regions corresponding to the uncovered text segments are labeled as uncovered, thus obtaining the second modality pre-training data.

3. The method as described in claim 1, characterized in that, Before pre-training the first multimodal information extraction model based on multimodal pre-trained data from the target domain, the method further includes: Cover any one or more segments of text in a first text file corresponding to any one document, and use the covered text as a label to construct the first modality pre-training data. The text file is obtained by extracting text from any one of the documents in the first document set. The first document set includes N documents, where N is an integer greater than 1. The image regions corresponding to any one or more text segments in any document are covered, and the image regions corresponding to the covered text segments are marked as covered, while the image regions corresponding to the uncovered text segments are marked as uncovered, thus obtaining the second modality pre-training data. The text files in some combinations of N pairs of text files and images are replaced with different text files or the images are replaced with different images. The replaced combinations are labeled with tags indicating that the text and images are inconsistent. The unreplaced combinations are labeled with tags indicating that the text and images are consistent. This process yields the third modality pre-training data. The N pairs of text files and images include N text files and the corresponding images for each text file. The text files are obtained by extracting text from a document in the first document set.

4. The method as described in claim 3, characterized in that, The step of pre-training the first multimodal information extraction model based on multimodal pre-trained data from the target domain includes: The training of the first multimodal information extraction model can be terminated based on the target loss value, wherein the target loss value is related to a first loss value obtained through the first modality pre-training data, a second loss value obtained through the second modality pre-training data, and a third loss value obtained through the third modality pre-training data.

5. The method as described in claim 4, characterized in that, The target loss value is a weighted sum of the first loss value, the second loss value, and the third loss value.

6. The method according to any one of claims 1-5, characterized in that, Before fine-tuning the second multimodal information extraction model based on entity annotation data from the target domain, the method further includes: Obtain the image of any document in the second document set to obtain the target image; The target text file is obtained by recognizing all text from the target image, and the text boxes containing each segment of text are obtained from the target image. Mark the entity bounding box of the entity on the target image and obtain the entity label corresponding to the entity bounding box; The entity annotation data is obtained based on the entity box and the text box, wherein the entity annotation data is the entity label assigned to at least the text box.

7. The method as described in claim 6, characterized in that, The step of obtaining the entity annotation data based on the entity box and the text box includes: If the percentage of the overlapping area between the first text box and the first entity box on the corresponding image is greater than the first threshold, then the entity label corresponding to the first entity box will be used as the label corresponding to the first text box.

8. The method as described in claim 6, characterized in that, The step of marking the entity bounding box of the entity on the target image and obtaining the entity label corresponding to the entity bounding box includes: For an entity that spans multiple rows, mark it with an entity box and assign it an entity label.

9. A method for entity information extraction, characterized in that, The method includes: Entity information extraction is performed based on the image to be extracted, the text file to be extracted, the text boxes to be extracted, and the target multimodal entity information extraction model to obtain predicted entity information extraction results. The image to be extracted is the image corresponding to the document to be extracted; the text file to be extracted includes a text sequence obtained by extracting text from the image to be extracted; the text boxes to be extracted are the location boxes of each text segment determined on the image to be extracted; and the predicted entity information extraction results include all target entity fragments extracted from the document to be extracted, the entity labels corresponding to the target entity fragments, and the entity positions. Cover any one or more segments of text in a text file corresponding to any one document, and use the covered text as a label to construct the first modality pre-training data. The text file is obtained by extracting text from any one of the documents in the first document set. The first document set includes N documents, where N is an integer greater than 1. The text files in some combinations of N pairs of text files and images are replaced with different text files or the images are replaced with different images. The replaced combinations are labeled with tags indicating that the text and image are inconsistent, while the unreplaced combinations are labeled with tags indicating that the text and image are consistent. This process yields the third modality pre-training data. The N pairs of text files and images include N text files and the corresponding images for each text file. The text files are obtained by extracting text from documents in the first document set. The first multimodal information extraction model is pre-trained based on the multimodal pre-training data of the target domain to obtain the second multimodal information extraction model. The multimodal pre-training data is obtained by annotating the pre-training data and by extracting text and recognizing text boxes from each document in the first document set of the target domain. The second multimodal information extraction model is fine-tuned based on the entity annotation data of the target domain to obtain the target multimodal entity information extraction model.

10. The method as described in claim 9, characterized in that, Before the entity information extraction is completed based on the image to be extracted, the text file to be extracted, the text box to be extracted, and the target multimodal entity information extraction model, the method further includes: The document to be extracted is converted into an image to obtain the image to be extracted; The text file to be extracted is obtained by extracting text from the image to be extracted; The text boxes to be extracted are obtained by identifying the areas covered by each segment of text on the image to be extracted.

11. The method according to any one of claims 9-10, characterized in that, The method further includes merging multiple target entity fragments that span multiple rows to obtain a single entity object.

12. The method as described in claim 11, characterized in that, The process of merging multiple target entity fragments that span multiple rows to obtain a single entity object includes: determining, at least based on the entity tags and entity positions of the multiple target entity fragments, that the multiple target entity fragments belong to a single entity object to be extracted.

13. The method as described in claim 12, characterized in that, The step of determining that the multiple target entity fragments belong to a single entity object to be extracted, based at least on the entity tags and entity positions of multiple entity fragments, includes: If it is confirmed that the entity labels of the multiple target entity fragments are all the same, the entity positions of the multiple target entity fragments are adjacent, and the merging of all entity labels corresponding to the multiple target entity fragments satisfies the predetermined annotation specifications, then it is confirmed that the multiple target entity fragments belong to one entity object.

14. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by a processor, it can implement the method described in any one of claims 1-13.

15. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein, When the processor executes the program, it can implement the method as described in any one of claims 1-13.