An information extraction method and device, electronic equipment and storage medium

By processing document information through entity classification and entity association models, the entity categories and key-value pair relationships of document objects are identified and extracted, solving the problem of the lack of universality in the extraction of structured information in existing technologies and achieving efficient extraction from various documents.

CN115630166BActive Publication Date: 2026-06-16JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2022-11-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies are not applicable to extracting structured information from various documents and lack versatility.

Method used

By employing entity classification and entity association models, and using feature representation networks and entity classification networks to process document information, we can identify the entity categories of document objects and determine key-value pair relationships, thereby extracting structured information.

🎯Benefits of technology

It achieves universality in effectively extracting structured information from various documents, improving the accuracy and efficiency of information extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115630166B_ABST
    Figure CN115630166B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose an information extraction method and device, electronic equipment and storage medium. The method comprises: obtaining document information of a target document, and an entity classification model and an entity association model which have been trained; the entity classification model comprises a feature representation network and an entity classification network; inputting the document information into the feature representation network to obtain document features, and inputting the document features into the entity classification network to obtain an entity category of each document object in the target document; inputting key position features corresponding to key position entities and value position features corresponding to value position entities in the document features into the entity association model to obtain key-value pair relationships between each document object under the key position entities and each document object under the value position entities; and extracting at least two document objects with the key-value pair relationships as structured information from the target document, so as to effectively extract structured information from various documents.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of computer technology, and in particular to an information extraction method, apparatus, electronic device and storage medium. Background Technology

[0002] Documents (such as various certificates, tickets, forms, and reports) contain a large amount of text, layout, and formatting information, making them a very common and important form of document in people's daily work and life.

[0003] To better utilize documents, current methods mainly rely on manually defined rules or templates to extract structured information from documents, or in other words, to effectively extract structured information from documents.

[0004] In the process of realizing this invention, the inventors discovered the following technical problems in the prior art: it can only be applied to the situation of extracting structured information from documents with a specific layout or category, that is, the process of extracting structured information is not universal. Summary of the Invention

[0005] This invention provides an information extraction method, apparatus, electronic device, and storage medium to achieve effective extraction of structured information applicable to various documents.

[0006] In a first aspect, embodiments of the present invention provide an information extraction method, including:

[0007] Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network;

[0008] The document information is input into the feature representation network to obtain document features, and the document features are input into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities;

[0009] The key features corresponding to the key entity and the value features corresponding to the value entity in the document features are input into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0010] At least two document objects having the key-value pair relationship are extracted from the target document as structured information.

[0011] Secondly, embodiments of the present invention also provide an information extraction device, comprising:

[0012] The information model acquisition module is used to acquire document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network;

[0013] The entity category determination module is used to input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities;

[0014] The key-value pair relationship determination module is used to input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0015] The structured information extraction module is used to extract at least two document objects with the key-value pair relationship from the target document as structured information.

[0016] Thirdly, embodiments of the present invention also provide an electronic device, the electronic device comprising:

[0017] One or more processors;

[0018] Storage device for storing one or more programs.

[0019] When the one or more programs are executed by the one or more processors, the one or more processors implement the information extraction method as described in any of the embodiments of the present invention.

[0020] Fourthly, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the information extraction method as described in any of the embodiments of the present invention.

[0021] The technical solution of this invention involves inputting the acquired document information into the feature representation network of an entity classification model to obtain document features, and then inputting these document features into the entity classification network of the entity classification model to obtain the entity category of each document object in the target document. Further, the key features corresponding to key entities and the value features corresponding to value entities in the document features are input into an entity association model to obtain the key-value pair relationships between document objects under key entities and document objects under value entities. At least two document objects with key-value pair relationships are then extracted from the target document as structured information. It is understood that entity classification models used for key and value entity classification can be applied to entity classification of document objects in various documents. Based on this, combined with an entity association model that can be used to perform key-value association between document objects under key entities and document objects under value entities, structured information can be extracted from various documents, thereby ensuring the universality of structured information extraction. Attached Figure Description

[0022] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale.

[0023] Figure 1 This is a flowchart illustrating an information extraction method provided in an embodiment of the present invention;

[0024] Figure 2 This is a flowchart illustrating another information extraction method provided in an embodiment of the present invention;

[0025] Figure 3 This is a schematic diagram of field-level entity category alignment provided in an embodiment of the present invention;

[0026] Figure 4 This is a schematic diagram of field-level position coordinate alignment provided in an embodiment of the present invention;

[0027] Figure 5 This is a flowchart illustrating a document structured information extraction method provided in an embodiment of the present invention;

[0028] Figure 6 This is a flowchart of a character-level-field-level entity alignment module provided in an embodiment of the present invention;

[0029] Figure 7 This is a flowchart illustrating another information extraction method provided in an embodiment of the present invention;

[0030] Figure 8This is a flowchart illustrating another information extraction method provided in an embodiment of the present invention;

[0031] Figure 9 This is a flowchart of an upstream pre-training task provided by an embodiment of the present invention;

[0032] Figure 10 This is a downstream application task flowchart provided in an embodiment of the present invention;

[0033] Figure 11 This is a schematic diagram of the structure of an information extraction device provided in an embodiment of the present invention;

[0034] Figure 12 This is a schematic diagram of the structure of the electronic device provided in the embodiment of the present invention. Detailed Implementation

[0035] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0036] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.

[0037] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0038] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0039] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0040] Figure 1This is a flowchart illustrating an information extraction method provided in an embodiment of the present invention. The present invention is adapted to the extraction of structured information from documents. The method can be executed by an information extraction device provided in the present invention. The information extraction device can be implemented in the form of software and / or hardware, or optionally, by an electronic device, such as a terminal device or a server.

[0041] like Figure 1 The method in this embodiment includes:

[0042] S110. Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network.

[0043] In this embodiment of the invention, the target document refers to the document for which structured information extraction is to be performed, such as various certificates, tickets, forms, reports, etc. The type of the target document may include, but is not limited to, PDF, doc, JPG, etc. Document information refers to information associated with the target document, and may include at least one of image information, text information, and text location information. In some embodiments, document information may also be feature information of image information, text information, and text location information, such as image visual features, text semantic features, and location layout features; alternatively, document information may be a fusion of the aforementioned image visual features, text semantic features, and location layout features. In some embodiments, document information may also be feature vectors obtained by vectorizing feature information, etc., which is not limited in this embodiment.

[0044] Specifically, the target document can be parsed using a parsing tool to extract text and text location information. For example, the parsing tool could be a built-in PDF parsing tool in Python. Alternatively, OCR (Optical Character Recognition) methods can be used to recognize characters in the target document and obtain text information. Furthermore, document information can also be information after feature extraction from text, text location information, etc. This embodiment does not limit the method for obtaining document information. Additionally, the trained entity classification model and entity association model can be retrieved from a preset storage path; the storage path for the entity classification model and entity association model is not limited here.

[0045] S120. Input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities.

[0046] In this embodiment of the invention, the entity classification model refers to a network model for classifying entities in document information. This model includes a pre-trained feature representation network and an entity classification network. The specific network architecture of the feature representation network and the entity classification network is not limited here; for example, the feature representation network can be a transformer encoder network, etc. Document information can be used as input data to the feature representation network, which then outputs document features. Document features refer to information that can characterize the features of document information. For example, document features can be one or more of the following: visual features corresponding to image information, linguistic features corresponding to text information, and spatial location features corresponding to text location information; document features can also be a fusion of multiple extracted features. This embodiment does not limit the categories of document features obtained.

[0047] Furthermore, document features can be used as input data for an entity classification network, which then outputs the entity category of each document object in the target document. Here, a document object refers to the object information to be extracted from the target document, which may include, but is not limited to, characters and fields. Each document object can have a corresponding entity category. In this embodiment of the invention, the entity category refers to the type of the document object, which at least includes key entities or value entities.

[0048] In some embodiments, entity categories may also include header entities and other entities, which can also be identified through an entity classification network.

[0049] S130. Input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0050] In this embodiment of the invention, the entity association model can be used to determine the key-value pair relationships between document objects in a target document. The key features corresponding to each document object under the key entity and the value features corresponding to each document object under the value entity can be used as input data for the entity association model. The entity association model then outputs the key-value pair relationships between document objects; in other words, the entity association model can obtain document objects with key-value pair relationships.

[0051] For example, if the key features corresponding to the document object "Name", the key features corresponding to the document object "Phone Number", the value features corresponding to the document object "Zhang San", and the value features corresponding to the document object "10086" are input into the entity association model, then the output will show that there is a key-value pair relationship between "Name" and "Zhang San", and there is a key-value pair relationship between "Phone Number" and "10086".

[0052] S140. Extract at least two document objects with the key-value pair relationship from the target document as structured information.

[0053] For example, document objects with key-value pair relationships such as "name" and "Zhang San" and "Li Si" can be extracted from the target document as structured information, thereby realizing the extraction of structured information from the document.

[0054] Based on the above embodiments, obtaining document information of a target document includes: obtaining a document image of the target document; performing object detection on the document image to obtain the object position of the document object in the document image; and performing object recognition on the document object located at the object position to obtain the object content of the document object; using at least two modal features from the image visual features of the document image, the positional layout features of the object position, and the content semantic features of the object content as document information of the target document; and inputting the document information into the feature representation network to obtain document features includes: inputting the document information into the feature representation network to fuse the features represented by the document information in the at least two modalities to obtain multimodal fusion features, and using the multimodal fusion features as document features.

[0055] Here, "document image" refers to the image format of the target document. Specifically, the target document can be converted into an image format document image using conversion tools. For example, if the target document is in PDF format, a PDF to image tool can be used to convert the PDF target document into an image format document image; alternatively, the document image of the target document can be obtained by taking a picture or scanning it. "Object position" can be understood as the position coordinates of the text object. "Object content" refers to the specific text content represented by the text object.

[0056] For example, after obtaining the document image of the target document, OCR text recognition can be performed on the document image to obtain the position coordinates of characters or fields in the document image and the specific text content; further, the document image can be input into a pre-trained convolutional neural network to obtain the image visual features of the document image; the word-piece algorithm can be used to encode the object content, and then feature projection can be performed to obtain the semantic features of the object content; the object positions of multiple document objects are normalized, and then feature projection is performed on the text height, width, position coordinates of the four corner points, and Euclidean distances between the four corner points and the center point of each pair of text boxes to obtain the position layout features.

[0057] In this embodiment of the invention, at least two features can be selected from the visual features of the document image, the positional layout features of the object location, and the semantic features of the object content as the document information of the target document, thereby realizing the acquisition of multimodal information.

[0058] Furthermore, multimodal document information can be input into a feature representation network to fuse features from at least two modalities represented by the document information, resulting in multimodal fusion features that reflect the correlation of multimodal document information. This allows entity classification models and entity association models to be trained based on a small amount of standard data, thereby improving the versatility of information extraction methods.

[0059] The technical solution of this invention involves inputting the acquired document information into the feature representation network of an entity classification model to obtain document features, and then inputting these document features into the entity classification network of the entity classification model to obtain the entity category of each document object in the target document. Further, the key features corresponding to key entities and the value features corresponding to value entities in the document features are input into an entity association model to obtain the key-value pair relationships between document objects under key entities and document objects under value entities. At least two document objects with key-value pair relationships are then extracted from the target document as structured information. It is understood that entity classification models used for key and value entity classification can be applied to entity classification of document objects in various documents. Based on this, combined with an entity association model that can be used to perform key-value association between document objects under key entities and document objects under value entities, structured information can be extracted from various documents, thereby ensuring the universality of structured information extraction.

[0060] refer to Figure 2 , Figure 2This is a flowchart illustrating another information extraction method provided in an embodiment of the present invention. The method in this embodiment can be combined with various optional schemes in the information extraction methods provided in the above embodiments. The information extraction method provided in this embodiment has been further refined. Optionally, the document object includes document characters. After obtaining the entity category of each document object in the target document, the method further includes: taking the document characters under the key entity and the document characters under the value entity as target characters, and determining the target field where each target character is located; for each target field, determining the entity category of the target field according to the entity category of each target character in the target field, wherein the entity category of the document object includes the key entity or the value entity; and inputting the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the information extraction method. The key-value pair relationship between each document object under the key entity and each document object under the value entity includes: inputting the key features corresponding to each target field under the key entity and the value features corresponding to each target field under the value entity into the entity association model to obtain the key-value pair relationship between each target field under the key entity and each target field under the value entity; the step of extracting at least two document objects with the key-value pair relationship as structured information from the target document includes: extracting at least two target fields with the key-value pair relationship as structured information from the target document.

[0061] like Figure 2 The method in this embodiment includes:

[0062] S210. Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network.

[0063] S220. Input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document character in the target document, wherein the entity category of the document object includes at least key entities or value entities.

[0064] S230. The document characters under the key entity and the document characters under the value entity are taken as target characters, and the target field where each target character is located is determined respectively.

[0065] In this embodiment of the invention, the entity classification model can achieve character-level entity classification, that is, it can obtain the key entities and value entities of document characters. A document character refers to an individual character in the target document, and the target document may include multiple document characters.

[0066] Specifically, the document characters under the key entity and the document characters under the value entity are used as target characters, and the position information of each target character is obtained. The target field can be determined based on the position information of each target character. The position information of the target characters can be obtained through OCR methods or document parsing tools, and is not limited here.

[0067] For example, an electronic device can pre-obtain character attribution information. This information can be that positions 0-5 belong to field A, and positions 6-10 belong to field B. If the position information of the target character is 3, then the target character belongs to field A; if the position information of the target character is 8, then the target character belongs to field B.

[0068] S240. For each target field, determine the entity category of the target field according to the entity category of each target character in the target field, wherein the entity category of the document object includes the key entity or the value entity.

[0069] It is understood that a target field can consist of multiple target characters. When the entity category of each target character is known, the entity category of the unknown target field can be determined based on the entity category of the target character. In some embodiments, the entity category with the most occurrences within the target field can be used as the entity category of the target field. In some embodiments, the confidence level of the entity category of each target character within the target field can also be obtained, and the entity category with the highest confidence level can be selected as the entity category of the target field; this is not limited here.

[0070] Based on the above embodiments, determining the entity category of the target field according to the entity category of each target character in the target field includes: determining a first number of target characters belonging to the key entity and a second number of target characters belonging to the value entity in the target field; determining the entity category of the target field based on the numerical relationship between the first number and the second number; or, classifying the target characters belonging to the key entity in the target field to the key field whose entity category is the key entity, and classifying the target characters belonging to the value entity to the value field whose entity category is the value entity, and updating the target field based on the key field and the value field to obtain the entity category of the target field.

[0071] For example, Figure 3 This is a schematic diagram of field-level entity category alignment provided by an embodiment of the present invention. This field-level entity category alignment method can also be referred to as an entity category voting mechanism. The obtained target field can be "contact phone number". Through a character-level entity classification model, it is determined that the entity categories of "lian", "xi", and "dian" are key entities (key), and the entity category of "hua" is a value entity (value). That is, the number of target characters belonging to the key entity is greater than the number of target characters belonging to the value entity, so that the entity category of the target field can be determined as the key entity. Or, the obtained target field can be "Name: Zhang San". "Name" can be classified into a key field with a key entity as its entity category, and "Zhang San" can be classified into a value field with a value entity as its entity category, achieving a reclassification of the target field. In a situation where the voting mechanism cannot make a judgment, field alignment can be achieved, improving the accuracy of field alignment.

[0072] S250. Input the key features corresponding to each of the target fields under the key entity and the value features corresponding to each of the target fields under the value entity in the document features into the entity association model, and obtain the key-value pair relationship between each of the target fields under the key entity and each of the target fields under the value entity.

[0073] In an embodiment of the present invention, the key features corresponding to each of the target fields under the key entity and the value features corresponding to each of the target fields under the value entity in the document features can be used as the input data of the entity association model. Furthermore, through the entity association model, the key-value pair relationship between each of the target fields under the key entity and each of the target fields under the value entity is output. In other words, through the entity association model, target fields with a key-value pair relationship can be obtained.

[0074] Exemplarily, if the key features corresponding to the target field "phone number", the value features corresponding to the target field "010 - 12345678", the key features corresponding to the target field "name", and the value features corresponding to the target field "Zhang San" are input into the entity association model, then it can be determined that there is a key-value pair relationship between the two target fields "phone number" and "010 - 12345678", and there is a key-value pair relationship between the two target fields "name" and "Zhang San".

[0075] S260. Extract at least two of the target fields having the key-value pair relationship as structured information from the target document.

[0076] Based on the above embodiments, after determining the target field where each target character is located, the method further includes: for each target field, determining the character position of each target character in the target document, and concatenating the character positions to obtain the field position of the target field in the target document; extracting at least two target fields with the key-value pair relationship from the target document as structured information includes: using at least two target fields with the key-value pair relationship as structured information, determining the information position of the structured information in the target document based on the field positions of the at least two target fields; and extracting the structured information from the information position in the target document.

[0077] Specifically, for each target field, based on the start and end coordinates of each target character within the target field in the target document, the entity encoding position coordinates of each target character within the target field are determined. Further, the entity encoding position coordinates of each target character within the target field are decoded to obtain the text box position coordinates (i.e., character positions) of each target character within the target field. The text box position coordinates of each target character are then concatenated, such as... Figure 4 As shown, the coordinates for splicing are calculated as follows:

[0078] xmin=min(a_x1,b_x1,c_x1,…,f_x1)

[0079] ymin=min(a_y1,b_y1,c_y1,…,f_y1)

[0080] xmax=max(a_x3,b_x3,c_x3,…,f_x3)

[0081] ymax=max(a_y3,b_y3,c_y3,…,f_y3)

[0082] Among them, xmin = min(a_x1, b_x1, c_x1, …, f_x1) represents taking the minimum value of the horizontal coordinate values of the left vertices of the target characters a - f as the horizontal coordinate value of the left vertex of the field position; ymin = min(a_y1, b_y1, c_y1, …, f_y1) represents taking the minimum value of the vertical coordinate values of the left vertices of the target characters a - f as the vertical coordinate value of the left vertex of the field position; xmax = max(a_x3, b_x3, c_x3, …, f_x3) represents taking the maximum value of the horizontal coordinate values of the right vertices of the target characters a - f as the horizontal coordinate value of the right vertex of the field position; ymax = max(a_y3, b_y3, c_y3, …, f_y3) represents taking the maximum value of the vertical coordinate values of the right vertices of the target characters a - f as the vertical coordinate value of the right vertex of the field position. Finally, the obtained (xmin, ymin) and (xmax, ymax) are used as the position coordinates of the text box.

[0083] Based on the above embodiments, the document information includes content information, and the content information is related to the character content of each of the document characters identified from the target document. The method further includes: splicing each of the content information to obtain an information sequence, and recording the sequence index of each content information in the information sequence; respectively determining the character positions of each of the target characters in the target field in the target document, including: for each of the target characters, obtaining the sequence index of the content information corresponding to the target character, and determining the character position of the target character in the target document according to the sequence index.

[0084] In the embodiments of the present invention, the content information refers to information associated with specific document content, including but not limited to character content, encoding results obtained by encoding the character content, content features obtained by projecting the encoding results, feature vectors obtained by vectorizing the content features, etc. Optionally, the information sequence may be a position encoding sequence. It can be understood that the sequence index is the position information of the character content in the information sequence. Through the sequence index, the corresponding character content can be located in the character content, and thus the character position of the target character in the target document can be obtained.

[0085] Exemplarily, when the content information is content information, splicing the identified multiple character contents to obtain an information sequence including multiple character contents, and recording the sequence index of each character content in the information sequence; if the sequence index of the target character "Zhang" is 2, it means that "Zhang" is located at the 3rd position in the information sequence, corresponding to the position of the third column in the first row of the target document.

[0086] The technical solution of this invention involves using document characters under key entities and document characters under value entities as target characters, and determining the target field where each target character resides. For each target field, the entity category of the target field is determined based on the entity category of each target character within the target field. The entity category of the document object includes either a key entity or a value entity. The key features corresponding to each target field under the key entity and the value features corresponding to each target field under the value entity are input into an entity association model to obtain the key-value pair relationships between each target field under the key entity and each target field under the value entity. At least two target fields with key-value pair relationships are extracted from the target document as structured information. This invention extracts structured information from the target document using target fields with key-value pair relationships as structured information. Compared to character-level extraction methods, this approach offers stronger overall integration and effectively improves the extraction speed of structured information.

[0087] refer to Figure 5 , Figure 5 This is a flowchart illustrating a method for extracting structured information from documents, provided as an embodiment of the present invention. An optional example is also provided based on the above embodiment.

[0088] In this embodiment of the invention, the positional relationships of document characters in the target document contain rich semantic information. For example, forms are usually displayed in the form of key-value pairs. Typically, key-value pairs are arranged horizontally or vertically and have specific type relationships. Through pre-training, this positional information, which is naturally aligned with the text, can provide richer semantic information for downstream information extraction tasks. In addition to the positional relationships of the text itself, text formatting (such as text size, whether italicized, whether bold, and font), and the visual information presented by the overall document image (i.e., image information) can also provide rich visual information for downstream information extraction tasks through pre-training.

[0089] This invention utilizes three upstream pre-training tasks to extract rich image, text, and text location information from a document into different feature representations. These feature representations are then used by two downstream tasks to explore the differences and relationships between the different features. Furthermore, this invention combines these two downstream tasks to design a general document information extraction scheme, which has been applied and implemented in actual business operations.

[0090] The following will describe in detail the general document information extraction scheme based on two downstream tasks: entity classification and association, proposed in the embodiments of the present invention.

[0091] like Figure 5 As shown, an input document image is used. A text detection and optical character recognition (OCR) model is employed to extract the corresponding text location information and text information from the image. The image information, text location information, and text information are then input into the model to obtain multimodal fusion features. These multimodal fusion features are then used for token-level entity classification prediction. The entity extraction module extracts the token-level key and value entities. Typically, key-value pairs in forms or invoices appear as fields; therefore, obtaining only the token-level entity classification results makes it difficult to perform subsequent structured extraction of field-level key-value pairs.

[0092] Therefore, this embodiment of the invention proposes to pass the character-level key entities and character-level value entities output from entity classification through a character-level to field-level entity alignment module to obtain field-level entity classification results. Finally, by inputting the feature representations corresponding to the field-level key entities and field-level value entities into the entity association model, structured key-value pair extraction results are obtained.

[0093] Specifically, such as Figure 6 As shown, the character-level-field-level entity alignment module design proposed in this embodiment of the invention is as follows:

[0094] Aligning character-level entity features to their corresponding field-level entity features involves three aspects: alignment of position coordinates, alignment of text content, and alignment of entity category.

[0095] Before entering the model, the text information output by OCR text recognition is encoded into a sequence (i.e., an information sequence) based on individual characters within the text content. This sequence retains the start and end positions of each character within the sequence. After passing through the token-level entity classification prediction model, the entity extraction module extracts the token-level key and value entities. Based on the sequence index of the token-level entities in the key and value, the start and end positions (i.e., start and end coordinates) of each token-level entity within the entire sequence are obtained.

[0096] Field-level position coordinate alignment: Based on the start and end positions, extract the entity code position coordinates of each character (i.e., the target character) in the field (i.e., the target field). After obtaining the entity code position coordinates of each character in the field, decode these character-level entity code position coordinates to obtain the text box position coordinates of each character in the field within the entire image. Then, combine the start and end positions of the field to concatenate and merge the text box coordinates of individual characters. After concatenating the position coordinates of each decoded character, the final position coordinates of the entire field are obtained, i.e., the field-level position coordinates.

[0097] Field-level text content alignment: Based on the start and end positions, extract the text content of each character in the field. After obtaining the text content of each character in the field, concatenate the text content to obtain the field-level text content.

[0098] Field-level entity category alignment: A voting mechanism is used to obtain field-level entity categories for character-level categories. Specifically, for the entity "Contact Phone Number", after character-level entity classification, each character and its corresponding entity category are "Contact:key", "Contact:key", "Telephone:key", and "Phone:value". According to the voting mechanism, the category with the most categories in the field is selected as the final category of the field. Therefore, the field-level entity "Contact Phone Number" is finally classified as the key entity.

[0099] Furthermore, after completing field-level position coordinate alignment, text content alignment, and entity category alignment, the classification results of character-level key entities and character-level value entities extracted by the entity extraction module are decoded into field-level entity classification results. Then, based on this field-level entity category, the multimodal entity association task is used to predict and extract entities with key-value pairs in the key and value entities, ultimately achieving intelligent structured information extraction from documents.

[0100] refer to Figure 7 , Figure 7This is a flowchart illustrating another information extraction method provided in this embodiment of the invention. The method in this embodiment can be combined with various optional schemes in the information extraction methods provided in the above embodiments. The information extraction method provided in this embodiment has been further refined. Optionally, the entity classification model is pre-trained through the following steps: taking the sample information of the entity classification document and the entity category labeling results of each sample object in the entity classification document as a set of entity classification samples; training the original classification model based on multiple sets of the entity classification samples to obtain the entity classification model, wherein the original classification model includes an intermediate representation network corresponding to the feature representation network and an original classification network corresponding to the entity classification network. The entity association model is pre-trained through the following steps: for key objects belonging to the key entities and value objects belonging to the value entities in the entity association document, the sample information of the entity association document and the entity association annotation results of each key object and each value object are taken as a set of entity association samples; the original association model is trained based on multiple sets of entity association samples to obtain the entity association model, wherein the entity association model includes an entity association network, and the original association model includes an intermediate representation network corresponding to the feature representation network and an original association network corresponding to the entity association network.

[0101] like Figure 7 The method in this embodiment includes:

[0102] S310. The sample information of the entity classification document and the entity category labeling results of each sample object in the entity classification document are taken as a set of entity classification samples.

[0103] In this embodiment of the invention, the sample information of an entity classification document refers to document information associated with the entity classification document and capable of serving as training samples. For example, the sample information of an entity classification document can be image information, text information, and text location information; alternatively, the sample information can also be image visual features, semantic features, and spatial location features corresponding to image information, text information, and text location information, respectively, without limitation. Furthermore, by parsing or recognizing the sample information, one or more sample objects can be obtained. For example, a sample object can be a character or field in the sample information. The entity category labeling result can be understood as the label of the sample objects in the entity classification document; that is, the entity classification samples are labeled samples and can be used for supervised training of the model.

[0104] S320. The original classification model is trained based on multiple sets of entity classification samples to obtain the entity classification model, wherein the original classification model includes an intermediate representation network corresponding to the feature representation network and an original classification network corresponding to the entity classification network.

[0105] In this embodiment of the invention, the original classification model is trained using multiple sets of entity classification samples. By continuously adjusting the parameters of the intermediate representation network and the original classification network, the distance deviation between the output of the original classification model and the entity category labeling results gradually decreases and stabilizes, resulting in a well-trained entity classification model. This embodiment does not impose any limitations on the specific network structures of the intermediate representation network and the original classification network.

[0106] S330. For the key objects belonging to the key entities and the value objects belonging to the value entities in the entity association document, the sample information of the entity association document and the entity association annotation results of each key object and each value object are taken as a set of entity association samples.

[0107] In this context, sample information for entity-related documents refers to document information associated with entity-related documents and capable of serving as training samples. For example, sample information for entity-related documents can be image information, text information, and text location information within the entity-related document; alternatively, it can be the image visual features, semantic features, and spatial location features corresponding to the image information, text information, and text location information, respectively—these are not limited here. Sample information may include one or more key objects and value objects. Entity association annotation results can be understood as labels for key objects or value objects; that is, entity-related samples are labeled samples that can be used for supervised training of the model.

[0108] S340. The original association model is trained based on multiple sets of entity association samples to obtain the entity association model, wherein the entity association model includes an entity association network, and the original association model includes an intermediate representation network corresponding to the feature representation network and an original association network corresponding to the entity association network.

[0109] In this embodiment of the invention, the original classification model is trained using multiple sets of entity classification samples. By continuously adjusting the parameters of the intermediate representation network and the original classification network, the distance deviation between the output of the original classification model and the entity category labeling results gradually decreases and stabilizes, resulting in a well-trained entity classification model. This embodiment does not impose any limitations on the specific network structures of the intermediate representation network and the original association network.

[0110] S350. Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network.

[0111] S360. Input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities.

[0112] S370. Input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0113] S380. Extract at least two document objects with the key-value pair relationship from the target document as structured information.

[0114] The technical solution of this invention uses sample information from entity classification documents and entity category annotation results for each sample object in the entity classification documents as a set of entity classification samples. Based on multiple sets of entity classification samples, the original classification model is trained to obtain an entity classification model. For key objects belonging to key entities and value objects belonging to value entities in entity association documents, sample information from the entity association documents and entity association annotation results for each key object and each value object are used as a set of entity association samples. Based on multiple sets of entity association samples, the original association model is trained to obtain an entity association model. The above technical solution, through training, obtains both an entity classification model and an entity association model, which is an important prerequisite for the effective extraction of subsequent structured information.

[0115] refer to Figure 8 , Figure 8 This is a flowchart illustrating another information extraction method provided in an embodiment of the present invention. The method in this embodiment can be combined with various optional schemes in the information extraction methods provided in the above embodiments. The information extraction method provided in this embodiment has been further refined. Optionally, the intermediate representation network is obtained by training the original representation network in the pre-trained model, and the pre-trained model includes at least one of a content masking model, an image content matching model, and an image masking model.

[0116] like Figure 8 The method in this embodiment includes:

[0117] S410. Train the original representation network in the pre-trained model to obtain an intermediate representation network. The pre-trained model includes at least one of a content masking model, an image content matching model, and an image masking model.

[0118] In this embodiment of the invention, the pre-trained model refers to a model that has undergone a pre-training task. The original representation network in the pre-trained model refers to the initial network that has not been trained. For example, the original representation network can be a Transformer Encoder network, which has a self-attention mechanism module that can learn the correlations between multiple modalities.

[0119] It should be noted that pre-trained models can perform different pre-training tasks, and these tasks can be processed in parallel, meaning that different pre-training tasks can simultaneously adjust the network parameters of the same pre-trained model. Furthermore, pre-training tasks are upstream application tasks, while multimodal entity classification and multimodal entity association tasks are downstream application tasks. That is, content masking models, image content matching models, and image masking models trained on upstream application tasks can be used to train models in downstream application tasks.

[0120] Figure 9 This is a flowchart of the upstream pre-training task. Figure 9 Pre-training tasks 1, 2, and 3 in the diagram represent multimodal text masking, multimodal image content matching, and multimodal image masking, respectively. Pre-training data construction refers to extracting image, text, and text location information from the target document. Multimodal fusion refers to feature extraction and feature fusion of the extracted image, text, and text location information to obtain multimodal fused features.

[0121] Figure 10 This is a flowchart of downstream application tasks. Figure 10 Downstream Task 1 and Downstream Task 2 in the model can represent the multimodal entity classification task and the multimodal entity association task, respectively. Downstream labeled data refers to the annotation of the training samples of the entity classification model and the entity association model.

[0122] In this embodiment of the invention, the pre-training task may include, but is not limited to, a multimodal text information masking task, a multimodal image content matching task, and a multimodal image information masking task; the corresponding models are, in turn, a content masking model, an image content matching model, and an image masking model.

[0123] S420. The sample information of the entity classification document and the entity category labeling results of each sample object in the entity classification document are taken as a set of entity classification samples.

[0124] S430. The original classification model is trained based on multiple sets of entity classification samples to obtain the entity classification model, wherein the original classification model includes an intermediate representation network corresponding to the feature representation network and an original classification network corresponding to the entity classification network.

[0125] S440. For the key objects belonging to the key entities and the value objects belonging to the value entities in the entity association document, the sample information of the entity association document and the entity association annotation results of each key object and each value object are taken as a set of entity association samples.

[0126] S450. The original association model is trained based on multiple sets of entity association samples to obtain the entity association model, wherein the entity association model includes an entity association network, and the original association model includes an intermediate representation network corresponding to the feature representation network and an original association network corresponding to the entity association network.

[0127] S460. Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network.

[0128] S470. Input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities.

[0129] S480. Input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0130] S490. Extract at least two document objects with the key-value pair relationship from the target document as structured information.

[0131] Optionally, the method further includes: identifying each sample object in the first sample document and encoding each identified first sample content to obtain an encoding sequence; replacing some of the lexical codes in the encoding sequence and projecting features onto the replaced encoding sequence to obtain a first semantic feature; inputting the first semantic feature, as well as the first visual feature and the first layout feature of the first sample document, into the original representation network in the content masking model to obtain a first fusion feature; predicting the replaced lexical codes based on the first fusion feature, and adjusting the network parameters in the original representation network of the content masking model based on the obtained replacement prediction result to obtain the intermediate representation network.

[0132] Here, the first sample document refers to the sample data used for training the content masking model. If the content of the first sample is text, the encoding sequence can be a text encoding sequence. Similarly, if the content of the first sample is text location content, the encoding sequence can be a text location encoding sequence. The specific content of the first sample is not limited here.

[0133] For example, the first sample document is parsed or OCR is performed to obtain multiple text contents, and each text content is encoded to obtain a text encoding sequence. A preset proportion of word codes are randomly replaced in the text encoding sequence, and feature projection is performed on the replaced text encoding sequence to obtain the first semantic feature. The first semantic feature, as well as the first visual feature and the first layout feature of the first sample document, are input into the TransformerEncoder network in the content masking model to obtain the first fusion feature of the multimodal model. The replaced word codes are predicted based on the first fusion feature of the multimodal model, and the network parameters in the Transformer Encoder network in the content masking model are adjusted based on the obtained replacement prediction results to train the content masking model, so that the content masking model can learn the ability of text reconstruction with the help of the first fusion feature of the multimodal model.

[0134] Optionally, the method further includes: acquiring second visual features of sample images of the second sample document and second semantic features of second sample content obtained after recognizing sample objects in the second sample document; forming multiple image content pairings based on the second visual features and second semantic features of multiple second sample documents, wherein the multiple image content pairings include correct pairings and incorrect pairings, the second visual features and second semantic features in the correct pairings correspond to the same second sample document, and the second visual features and second semantic features in the incorrect pairings correspond to different second sample documents; concatenating each of the correct pairings to obtain a correct pairing sequence, and inputting the correct pairing sequence and the second layout features of the second sample document associated with the correct pairing sequence into the original representation network in the image content pairing model to obtain... The system obtains the second correct fusion feature corresponding to each correct pairing; classifies the correct pairing sequence as either a correct or incorrect pairing based on each second correct fusion feature, and adjusts the network parameters in the original representation network of the image content pairing model according to the obtained first classification result to obtain the intermediate representation network; concatenates each incorrect pairing to obtain an incorrect pairing sequence, and inputs the incorrect pairing sequence and the second layout feature of the second sample document associated with the incorrect pairing sequence into the original representation network of the image content pairing model to obtain the second incorrect fusion feature corresponding to each incorrect pairing; classifies the incorrect pairing sequence as either a correct or incorrect pairing based on each second incorrect fusion feature, and adjusts the network parameters in the original representation network of the image content pairing model according to the obtained second classification result to obtain the intermediate representation network.

[0135] In this embodiment of the invention, the second sample document refers to sample data used for training the image content matching model. Optionally, the image content matching can be a tuple composed of semantic features and visual features.

[0136] For example, the second visual features of the sample images of the second sample documents are obtained through a convolutional neural network, and the sample objects in the second sample documents are identified through an OCR method to obtain the second sample content. The second sample content is then encoded using a word-piece algorithm, and feature projection is performed after encoding to obtain the second semantic features. Further, based on the second visual features and second semantic features of multiple second sample documents, multiple image content pairings are formed. These multiple image content pairings include correct pairings and incorrect pairings. The second visual features and second semantic features of correct pairings correspond to the same second sample document, while the second visual features and second semantic features of incorrect pairings correspond to different second sample documents. The correct pairings are concatenated to obtain a correct pairing sequence. The correct pairing sequence and the second layout features of the second sample documents associated with the correct pairing sequence are input into the Transformer Encoder network in the image content pairing model to obtain the second correct fusion features corresponding to each correct pairing. The correct pairing sequence is classified as either a correct pairing or an incorrect pairing based on the second correct fusion features, and the Transformer in the image content pairing model is adjusted based on the obtained first classification result. The network parameters in the Encoder network are used to obtain the intermediate representation network; the erroneous pairs are concatenated to obtain an erroneous pairing sequence, and the erroneous pairing sequence and the second layout features of the second sample document associated with the erroneous pairing sequence are input into the TransformerEncoder network in the image content pairing model to obtain the second error fusion features corresponding to each erroneous pairing; the erroneous pairing sequence is classified as a correct pairing or an erroneous pairing based on each second error fusion feature, and the network parameters in the Transformer Encoder network in the image content pairing model are adjusted based on the obtained second classification results to obtain the intermediate representation network.

[0137] Optionally, the method further includes: occluding a portion of the image region in the sample image of the third sample document and obtaining a third visual feature of the occluded sample image, wherein the occluded image region corresponds to a corresponding object in the third sample document; inputting the third visual feature, as well as the third semantic feature and the third layout feature of the third sample document, into the original representation network in the image occlusion model to obtain a third fusion feature; predicting whether the alignment object in the third sample document aligned with the occluded image region is occluded, or whether the alignment region in the sample image aligned with the corresponding object is occluded, based on the third fusion feature; and adjusting the network parameters in the original representation network of the image occlusion model based on the obtained occlusion prediction result to obtain the intermediate representation network.

[0138] For example, a predetermined proportion of image regions are randomly selected from the sample image for occlusion, and features are extracted from the occluded sample image using a convolutional neural network to obtain third visual features. The occluded image region corresponds to a corresponding object in the third sample document. The corresponding object can be a pre-set occlusion object, such as School A in the image sample. The third visual features, as well as the third semantic features and third layout features of the third sample document, are input into the Transformer Encoder in the image occlusion model to obtain third fusion features. Based on the third fusion features, it is predicted whether the alignment object in the third sample document aligned with the occluded image region is occluded, or whether the alignment region in the sample image aligned with the corresponding object is occluded. Based on the obtained occlusion prediction results, the network parameters in the Transformer Encoder network of the image occlusion model are adjusted to train the image occlusion model, enabling the image occlusion model to learn the alignment relationship between the image and the text using multimodal features.

[0139] The technical solution of this invention obtains an intermediate representation network by training the original representation network in the pre-trained model. This intermediate representation network can be used in the downstream training of entity classification and entity association models. This intermediate representation network already possesses the ability to obtain feature representations of certain commonalities between images and text by combining text positional relationships, text content, and image information. This allows downstream training to uncover the differences and associations between different features without requiring too many labeled samples, thereby reducing labeling costs.

[0140] Figure 11 This is a schematic diagram of the structure of an information extraction device provided in an embodiment of the present invention. Figure 11 As shown, the device includes:

[0141] The information model acquisition module 510 is used to acquire document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network.

[0142] The entity category determination module 520 is used to input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities;

[0143] The key-value pair relationship determination module 530 is used to input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0144] The structured information extraction module 540 is used to extract at least two document objects with the key-value pair relationship from the target document as structured information.

[0145] In some optional implementations of this invention, the document object includes document characters, and the apparatus further includes:

[0146] The target field determination module is used to take the document characters under the key entity and the document characters under the value entity as target characters, and determine the target field where each target character is located;

[0147] The field category determination module is used to determine the entity category of each target field based on the entity category of each target character within the target field, wherein the entity category of the document object includes the key entity or the value entity;

[0148] Correspondingly, the key-value pair relationship determination module 530 is specifically used for:

[0149] The key features corresponding to each target field under the key entity and the value features corresponding to each target field under the value entity in the document features are input into the entity association model to obtain the key-value pair relationship between each target field under the key entity and each target field under the value entity.

[0150] Correspondingly, the structured information extraction module 540 includes:

[0151] The structured information extraction unit is used to extract at least two of the target fields having the key-value pair relationship from the target document as structured information.

[0152] In some optional implementations of this invention, the field category determination module is specifically used for:

[0153] Determine a first number of target characters belonging to the key entity and a second number of target characters belonging to the value entity within the target field; and determine the entity category of the target field based on the numerical relationship between the first number and the second number.

[0154] Or,

[0155] The target characters belonging to the key entity in the target field are assigned to the key field whose entity category is the key entity, and the target characters belonging to the value entity are assigned to the value field whose entity category is the value entity. The target field is then updated based on the key field and the value field to obtain the entity category of the target field.

[0156] In some optional implementations of the embodiments of the present invention, the apparatus further includes:

[0157] The field position determination module is used to determine the character position of each target character in the target field in the target document for each target field, and to concatenate the character positions to obtain the field position of the target field in the target document;

[0158] The structured information extraction unit is specifically used for:

[0159] The key-value pair relationship of at least two target fields is used as structured information, and the information position of the structured information in the target document is determined according to the field position of the at least two target fields.

[0160] The structured information is extracted from the information location in the target document.

[0161] In some optional implementations of this invention, the document information includes content information, which is related to the character content of each document character identified from the target document. The apparatus is further configured to:

[0162] The content information is concatenated to obtain an information sequence, and the sequence index of each content information in the information sequence is recorded;

[0163] Determining the character position of each target character within the target field in the target document includes:

[0164] For each target character, obtain the sequence index of the content information corresponding to the target character, and determine the character position of the target character in the target document based on the sequence index.

[0165] In some optional implementations of this invention, the information model acquisition module 510 is specifically used for:

[0166] Obtain a document image of the target document, perform object detection on the document image to obtain the object position of the document object in the document image, and perform object recognition on the document object located at the object position to obtain the object content of the document object;

[0167] The document information of the target document is defined as features from at least two modalities among the visual features of the document image, the positional layout features of the object location, and the semantic features of the object content.

[0168] Correspondingly, the entity category determination module 520 is specifically used for:

[0169] The document information is input into the feature representation network to fuse the features of at least two modalities represented by the document information to obtain multimodal fusion features, and the multimodal fusion features are used as document features.

[0170] In some optional implementations of the embodiments of the present invention, the apparatus further includes:

[0171] The classification sample determination module is used to take the sample information of the entity classification document and the entity category labeling result of each sample object in the entity classification document as a set of entity classification samples;

[0172] The classification model training module is used to train the original classification model based on multiple sets of entity classification samples to obtain the entity classification model, wherein the original classification model includes an intermediate representation network corresponding to the feature representation network and an original classification network corresponding to the entity classification network.

[0173] In some optional implementations of the embodiments of the present invention, the apparatus further includes:

[0174] The associated sample determination module is used to take the sample information of the entity association document and the entity association annotation results of each key object and each value object as a set of entity association samples for key objects belonging to the key entity and value objects belonging to the value entity in the entity association document.

[0175] The association model training module is used to train the original association model based on multiple sets of entity association samples to obtain the entity association model. The entity association model includes an entity association network, and the original association model includes an intermediate representation network corresponding to the feature representation network and an original association network corresponding to the entity association network.

[0176] In some optional implementations of the present invention, the intermediate representation network is obtained by training the original representation network in the pre-trained model, and the pre-trained model includes at least one of the content masking model, the image content matching model, and the image masking model.

[0177] In some optional implementations of the embodiments of the present invention, the apparatus is further configured to:

[0178] Each sample object in the first sample document is identified, and the content of each identified first sample is encoded to obtain an encoding sequence. Some of the word codes in the encoding sequence are replaced, and the replaced encoding sequence is projected to obtain the first semantic features.

[0179] The first semantic feature, as well as the first visual feature and the first layout feature of the first sample document, are input into the original representation network of the content masking model to obtain the first fused feature;

[0180] The replacement term encoding is predicted based on the first fusion feature, and the network parameters in the original representation network of the content masking model are adjusted based on the obtained replacement prediction result to obtain the intermediate representation network.

[0181] In some optional implementations of the embodiments of the present invention, the apparatus is further configured to:

[0182] The second visual features of the sample image of the second sample document and the second semantic features of the second sample content obtained after recognizing the sample objects in the second sample document are obtained;

[0183] Based on the second visual features and second semantic features of multiple second sample documents, multiple image content pairings are formed, wherein the multiple image content pairings include correct pairings and incorrect pairings, the second visual features and second semantic features in the correct pairings correspond to the same second sample document, and the second visual features and second semantic features in the incorrect pairings correspond to different second sample documents;

[0184] The correct pairings are concatenated to obtain a correct pairing sequence. The correct pairing sequence and the second layout features of the second sample document associated with the correct pairing sequence are then input into the original representation network of the image content pairing model to obtain the second correct fusion features corresponding to each correct pairing.

[0185] The correct pairing sequence is classified as either a correct pairing or an incorrect pairing based on each of the second correct fusion features, and the network parameters in the original representation network of the image content pairing model are adjusted based on the obtained first classification result to obtain the intermediate representation network.

[0186] The various incorrect pairings are concatenated to obtain an incorrect pairing sequence. The incorrect pairing sequence and the second layout features of the second sample document associated with the incorrect pairing sequence are then input into the original representation network of the image content pairing model to obtain the second error fusion features corresponding to each incorrect pairing.

[0187] The incorrect pairing sequence is classified as either a correct pairing or an incorrect pairing based on each of the second error fusion features, and the network parameters in the original representation network of the image content pairing model are adjusted according to the obtained second classification results to obtain the intermediate representation network.

[0188] In some optional implementations of the embodiments of the present invention, the apparatus is further configured to:

[0189] A portion of the image region in the sample image of the third sample document is masked, and the third visual feature of the masked sample image is obtained, wherein the masked image region corresponds to the corresponding object in the third sample document;

[0190] The third visual feature, as well as the third semantic feature and the third layout feature of the third sample document, are input into the original representation network of the image occlusion model to obtain the third fusion feature;

[0191] Based on the third fusion feature, it is predicted whether the alignment object in the third sample document that is aligned with the occluded image region is occluded, or whether the alignment region in the sample image that is aligned with the corresponding object is occluded. Based on the obtained occlusion prediction results, the network parameters in the original representation network of the image occlusion model are adjusted to obtain the intermediate representation network.

[0192] The information extraction device provided in this embodiment of the invention can execute the information extraction method provided in any embodiment of this disclosure, and has the corresponding functional modules and beneficial effects for executing the information extraction method.

[0193] It is worth noting that the various units and modules included in the above-mentioned device are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, the specific names of each functional unit are only for easy differentiation and are not used to limit the protection scope of the embodiments of the present invention.

[0194] The following is for reference. Figure 12 It illustrates an electronic device suitable for implementing embodiments of the present invention (e.g., Figure 12The diagram below shows the structure of the terminal device or server 400. The terminal device in this embodiment may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and vehicle terminals (e.g., vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 12 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

[0195] like Figure 12 As shown, electronic device 400 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 401, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 402 or a program loaded from storage device 408 into random access memory (RAM) 403. RAM 403 also stores various programs and data required for the operation of electronic device 400. The processing device 401, ROM 402, and RAM 403 are interconnected via bus 404. Input / output (I / O) interface 405 is also connected to bus 404.

[0196] Typically, the following devices can be connected to I / O interface 405: input devices 406 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 407 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 408 including, for example, magnetic tapes, hard disks, etc.; and communication devices 409. Communication device 409 allows electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 12 An electronic device 400 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0197] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 409, or installed from a storage device 408, or installed from a ROM 402. When the computer program is executed by the processing device 401, it performs the functions defined in the methods of the embodiments of the present invention.

[0198] The electronic device provided in this embodiment of the invention and the information extraction method provided in the above embodiments belong to the same inventive concept. Technical details not described in detail in this embodiment can be found in the above embodiments, and this embodiment has the same beneficial effects as the above embodiments.

[0199] This invention provides a computer storage medium storing a computer program that, when executed by a processor, implements the information extraction method provided in the above embodiments.

[0200] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0201] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.

[0202] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0203] The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to:

[0204] Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network;

[0205] The document information is input into the feature representation network to obtain document features, and the document features are input into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities;

[0206] The key features corresponding to the key entity and the value features corresponding to the value entity in the document features are input into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity.

[0207] At least two document objects having the key-value pair relationship are extracted from the target document as structured information.

[0208] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including but not limited to object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0209] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0210] The units described in the embodiments of the present invention can be implemented in software or hardware. The names of the units / modules do not necessarily limit the specific unit itself.

[0211] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0212] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0213] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0214] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0215] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. An information extraction method, characterized in that, include: Obtain the document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network; The document information is input into the feature representation network to obtain document features, and the document features are input into the entity classification network to obtain the entity category of each document object in the target document. The entity category of the document object includes at least key entities or value entities; the document object includes document characters. The document characters under the key entity and the document characters under the value entity are used as target characters, and the position information of each target character is obtained. Based on the position information of each target character, the target field where each target character is located is determined. For each target field, the entity category of the target field is determined according to the entity category of each target character in the target field, wherein the entity category of the target field includes the key entity or the value entity; The key features corresponding to each target field under the key entity and the value features corresponding to each target field under the value entity in the document features are input into the entity association model to obtain the key-value pair relationship between each target field under the key entity and each target field under the value entity. At least two of the target fields that have the key-value pair relationship are extracted from the target document as structured information.

2. The method according to claim 1, characterized in that, Determining the entity category of the target field based on the entity category of each target character within the target field includes: Determine a first number of target characters belonging to the key entity and a second number of target characters belonging to the value entity within the target field; and determine the entity category of the target field based on the numerical relationship between the first number and the second number. Or, The target characters belonging to the key entity in the target field are assigned to the key field whose entity category is the key entity, and the target characters belonging to the value entity are assigned to the value field whose entity category is the value entity. The target field is then updated based on the key field and the value field to obtain the entity category of the target field.

3. The method according to claim 1, characterized in that, After determining the target field where each target character is located, the method further includes: For each target field, the character position of each target character within the target field in the target document is determined, and the character positions are concatenated to obtain the field position of the target field in the target document; The step of extracting at least two target fields having the key-value pair relationship as structured information from the target document includes: The key-value pair relationship of at least two target fields is used as structured information, and the information position of the structured information in the target document is determined according to the field position of the at least two target fields. The structured information is extracted from the information location in the target document.

4. The method according to claim 3, characterized in that, The document information includes content information, which is related to the character content of each document character identified from the target document. The method further includes: The content information is concatenated to obtain an information sequence, and the sequence index of each content information in the information sequence is recorded; Determining the character position of each target character within the target field in the target document includes: For each target character, obtain the sequence index of the content information corresponding to the target character, and determine the character position of the target character in the target document based on the sequence index.

5. The method according to claim 1, characterized in that, Obtain document information for the target document, including: Obtain a document image of the target document, perform object detection on the document image to obtain the object position of the document object in the document image, and perform object recognition on the document object located at the object position to obtain the object content of the document object; The document information of the target document is defined as features from at least two modalities among the visual features of the document image, the positional layout features of the object location, and the semantic features of the object content. The step of inputting the document information into the feature representation network to obtain document features includes: The document information is input into the feature representation network to fuse the features of at least two modalities represented by the document information to obtain multimodal fusion features, and the multimodal fusion features are used as document features.

6. The method according to claim 1, characterized in that, The entity classification model is pre-trained through the following steps: The sample information of the entity classification document and the entity category labeling results of each sample object in the entity classification document are used as a set of entity classification samples; The original classification model is trained based on multiple sets of entity classification samples to obtain the entity classification model, wherein the original classification model includes an intermediate representation network corresponding to the feature representation network and an original classification network corresponding to the entity classification network.

7. The method according to claim 1, characterized in that, The entity association model is pre-trained through the following steps: For key objects belonging to the key entity and value objects belonging to the value entity in the entity association document, the sample information of the entity association document and the entity association annotation results of each key object and each value object are taken as a set of entity association samples. The original association model is trained based on multiple sets of entity association samples to obtain the entity association model, wherein the entity association model includes an entity association network, and the original association model includes an intermediate representation network corresponding to the feature representation network and an original association network corresponding to the entity association network.

8. The method according to claim 6 or 7, characterized in that, The intermediate representation network is obtained by training the original representation network in the pre-trained model. The pre-trained model includes at least one of the content masking model, the image content matching model, and the image masking model.

9. The method according to claim 8, characterized in that, Also includes: Each sample object in the first sample document is identified, and the content of each identified first sample is encoded to obtain an encoding sequence. Some of the word codes in the encoding sequence are replaced, and the replaced encoding sequence is projected to obtain the first semantic features. The first semantic feature, as well as the first visual feature and the first layout feature of the first sample document, are input into the original representation network of the content masking model to obtain the first fused feature; The replacement term encoding is predicted based on the first fusion feature, and the network parameters in the original representation network of the content masking model are adjusted based on the obtained replacement prediction result to obtain the intermediate representation network.

10. The method according to claim 8, characterized in that, Also includes: The second visual features of the sample image of the second sample document and the second semantic features of the second sample content obtained after recognizing the sample objects in the second sample document are obtained; Based on the second visual features and second semantic features of multiple second sample documents, multiple image content pairings are formed, wherein the multiple image content pairings include correct pairings and incorrect pairings, the second visual features and second semantic features in the correct pairings correspond to the same second sample document, and the second visual features and second semantic features in the incorrect pairings correspond to different second sample documents; The correct pairings are concatenated to obtain a correct pairing sequence. The correct pairing sequence and the second layout features of the second sample document associated with the correct pairing sequence are then input into the original representation network of the image content pairing model to obtain the second correct fusion features corresponding to each correct pairing. The correct pairing sequence is classified as either a correct pairing or an incorrect pairing based on each of the second correct fusion features, and the network parameters in the original representation network of the image content pairing model are adjusted based on the obtained first classification result to obtain the intermediate representation network. The various incorrect pairings are concatenated to obtain an incorrect pairing sequence. The incorrect pairing sequence and the second layout features of the second sample document associated with the incorrect pairing sequence are then input into the original representation network of the image content pairing model to obtain the second error fusion features corresponding to each incorrect pairing. The incorrect pairing sequence is classified as either a correct pairing or an incorrect pairing based on each of the second error fusion features, and the network parameters in the original representation network of the image content pairing model are adjusted according to the obtained second classification results to obtain the intermediate representation network.

11. The method according to claim 8, characterized in that, Also includes: A portion of the image region in the sample image of the third sample document is masked, and the third visual feature of the masked sample image is obtained, wherein the masked image region corresponds to the corresponding object in the third sample document; The third visual feature, as well as the third semantic feature and the third layout feature of the third sample document, are input into the original representation network of the image occlusion model to obtain the third fusion feature; Based on the third fusion feature, it is predicted whether the alignment object in the third sample document that is aligned with the occluded image region is occluded, or whether the alignment region in the sample image that is aligned with the corresponding object is occluded. Based on the obtained occlusion prediction results, the network parameters in the original representation network of the image occlusion model are adjusted to obtain the intermediate representation network.

12. An information extraction device, characterized in that, include: The information model acquisition module is used to acquire document information of the target document, as well as the trained entity classification model and entity association model, wherein the entity classification model includes a feature representation network and an entity classification network; The entity category determination module is used to input the document information into the feature representation network to obtain document features, and input the document features into the entity classification network to obtain the entity category of each document object in the target document, wherein the entity category of the document object includes at least key entities or value entities; The key-value pair relationship determination module is used to input the key features corresponding to the key entity and the value features corresponding to the value entity in the document features into the entity association model to obtain the key-value pair relationship between each document object under the key entity and each document object under the value entity. The structured information extraction module is used to extract at least two document objects with the key-value pair relationship from the target document as structured information. The document object includes document characters, and the device further includes: The target field determination module is used to take the document characters under the key entity and the document characters under the value entity as target characters, obtain the position information of each target character, and determine the target field where each target character is located based on the position information of each target character. The field category determination module is used to determine the entity category of each target field based on the entity category of each target character within the target field, wherein the entity category of the target field includes the key entity or the value entity; The key-value pair relationship determination module is specifically used for: The key features corresponding to each target field under the key entity and the value features corresponding to each target field under the value entity in the document features are input into the entity association model to obtain the key-value pair relationship between each target field under the key entity and each target field under the value entity. The structured information extraction module includes: The structured information extraction unit is used to extract at least two of the target fields having the key-value pair relationship from the target document as structured information.

13. An electronic device, characterized in that, The electronic device includes: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the information extraction method as described in any one of claims 1-11.

14. A storage medium containing computer-executable instructions, characterized in that, The computer-executable instructions, when executed by a computer processor, are used to perform the information extraction method as described in any one of claims 1-11.