A knowledge distillation-based document image key information extraction method and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a knowledge distillation-based approach, a teacher model is trained by combining textual, visual, and layout features, and then subjected to knowledge distillation to generate a student model. This solves the problems of poor performance and difficult deployment in extracting key information from document images, achieving efficient information extraction and low-complexity deployment.

CN119810856BActive Publication Date: 2026-06-26CHINA TELECOM CLOUD TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA TELECOM CLOUD TECH CO LTD
Filing Date: 2024-12-05
Publication Date: 2026-06-26

Application Information

Patent Timeline

05 Dec 2024

Application

26 Jun 2026

Publication

CN119810856B

IPC: G06V30/41; G06V30/19; G06V30/18; G06V10/82; G06N3/096; G06N3/045

AI Tagging

Technology Topics

Feature extractionSample sequence

Technical Efficacy Phrases

reduce sizeReduce computational complexity

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A method for regulating the droplet migration speed in a microchannel based on nanoparticle surfactant
CN118320874BTransport speed controlControl interfacial tensionLaboratory glasswares Chemical/physical/physico-chemical processes Carboxyl radicalFunctionalized nanoparticles
An armored vehicle simulation system
CN224399980Ureduce sizecompact structure Cosmonautic condition simulations SimulatorsArmored carRobotic arm
A ductile cast iron casting device
CN117259740BEasy to poureasy to operateMolten metal conveying equipments
Integrated gear compressor with combined axial-radial compressor unit
CN122319309Areduce sizeavoid the needAxial compressorGear wheel
Molecular sieve calcination furnace
CN224353561UCharge manipulation Furnace types

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing models perform poorly in extracting key information from document images and cannot meet the requirements for practical application, mainly due to the large number of parameters and high requirements for the deployment environment.

Method used

A knowledge distillation-based approach is adopted to obtain text position, content, and layout features through initial teacher model detection and sequence labeling, generate comprehensive features, and train student models through knowledge distillation to reduce model complexity.

Benefits of technology

It improves the accuracy and robustness of key information extraction, while significantly reducing model complexity and resource consumption, meeting the needs of practical application.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119810856B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of computers and discloses a document image key information extraction method and device based on knowledge distillation, which comprises the following steps: an initial teacher model is used to detect and recognize a sample document image and perform sequence labeling, so that sample text position information, sample text content and sample sequence labeling information are obtained; feature extraction is performed based on the above three information, so that sample text features, sample visual features and sample layout features are obtained; sample text sequence labels are obtained based on the above three features; sample training loss is generated based on the sample text position information, the sample text content and the sample text sequence labels, an initial teacher model is trained to obtain a teacher model; knowledge distillation is performed on the teacher model to obtain a student model, so that key information extraction is performed on a target document image; and the application can guarantee key information extraction precision, significantly reduce model complexity and resource consumption, and meet model landing requirements.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and specifically to a method and apparatus for extracting key information from document images based on knowledge distillation. Background Technology

[0002] Document image key information extraction refers to the automatic extraction of specific information of interest to users from a given document image, and is one of the key tasks in the field of intelligent document analysis. It achieves the transformation from unstructured data to structured data by automatically extracting key information from document images, and is widely used in document image understanding, visual question answering, and many other fields.

[0003] With the development of deep learning technology, neural networks are commonly used to extract key information from document images. However, using only text features to extract key information results in poor extraction performance. Furthermore, the Transformer, Convolutional Neural Networks, and Graph Convolution techniques used in existing models significantly increase the number of model parameters, thereby raising the requirements for the deployment environment when the model is applied, making it difficult to deploy. Summary of the Invention

[0004] In view of this, the present invention provides a method and apparatus for extracting key information from document images based on knowledge distillation, in order to solve the problem of poor model performance and inability to meet the needs of practical application.

[0005] In a first aspect, the present invention provides a method for extracting key information from document images based on knowledge distillation, the method comprising:

[0006] For any sample document image, the initial teacher model is used to detect, identify and sequence label the sample document image to obtain the sample text location information, sample text content and sample sequence label information of the sample document image;

[0007] Using the initial teacher model, feature extraction is performed based on the sample text location information, sample text content, and sample sequence annotation information of the sample document image to obtain the sample text features, sample visual features, and sample layout features of the sample document image.

[0008] Using the initial teacher model, sample text sequence labels are obtained based on sample text features, sample visual features, and sample layout features of sample document images;

[0009] Based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, a sample training loss is generated, and the initial teacher model is trained based on the sample training loss to obtain the teacher model.

[0010] Based on the initial student model, knowledge distillation is performed on the teacher model to obtain the student model, and key information is extracted from the target document image based on the student model.

[0011] The document image key information extraction method based on knowledge distillation provided in this invention can accurately detect and identify the text position and content in document images by using an initial teacher model. Through sequence labeling, more detailed text structure information can be obtained. Then, based on the above information, feature extraction is performed to obtain sample text features, sample visual features, and sample layout features of the sample document image, which can more comprehensively represent the content and structure of the document image. By integrating multiple features, the generated text sequence labels are more accurate, not only including text content, but also reflecting the structure and positional relationship of the text in the document. The initial teacher model is trained based on the difference between the information extracted by the model and the actual information to obtain a teacher model. Knowledge distillation is then performed on the teacher model, which can inherit the key knowledge of the teacher model and maintain high performance. While ensuring the accuracy of key information extraction, the complexity and resource consumption of the model are significantly reduced, which can meet the needs of model implementation.

[0012] In one optional implementation, for any sample document image, an initial teacher model is used to detect, identify, and label the sample document image to obtain sample text location information, sample text content, and sample sequence labeling information, including:

[0013] Using the preprocessing layer of the initial teacher model, and based on optical character recognition technology, the sample document image is detected and recognized to obtain the sample text location information and sample text content;

[0014] The sample text content is labeled using sequence labeling methods to obtain sample sequence labeling information.

[0015] The document image key information extraction method based on knowledge distillation provided in this invention can accurately detect and identify the text content and position in document images by using optical character recognition technology. Through sequence annotation, the structural information of the text in the document can be obtained, providing not only the text content, but also the text context and structure, providing rich information support for subsequent feature extraction and key information extraction.

[0016] In one optional implementation, an initial teacher model is used to extract features based on the sample text content and sample sequence annotation information of the sample document image, resulting in sample text features of the sample document image, including:

[0017] Using the base training layer of the initial teacher model, the sample text content and sample sequence annotation information of the sample document image are segmented into words to obtain multiple sample sub-words;

[0018] For any sample sub-word, generate the word embedding vector and position embedding vector of the sample sub-word;

[0019] A multi-head self-attention mechanism is adopted to obtain the feature vector corresponding to the sample sub-word based on the word embedding vector and position embedding vector of the sample sub-word;

[0020] The feature vectors corresponding to all sample sub-words are used as the sample text features of the sample document image.

[0021] The document image key information extraction method based on knowledge distillation provided in this invention segmentes the sample text content and sample sequence annotation information of the sample document image to obtain multiple sample sub-words. The segmentation process can divide the text into smaller units, which facilitates subsequent feature extraction. Each sample sub-word generates a word embedding vector and a position embedding vector, enabling the model to better understand the structure and order of the text. Through the multi-head self-attention mechanism, long dependencies and local contextual information in the text can be captured, thereby generating richer feature representations. The generated sample text features are richer and more detailed, improving the robustness and generalization ability of the model.

[0022] In one optional implementation, an initial teacher model is used to extract features based on the sample text location information of the sample document image, resulting in sample visual features and sample layout features of the sample document image, including:

[0023] The base training layer of the initial teacher model is used to extract features from the sample document images based on the convolutional neural network to obtain the sample image features;

[0024] The positional information of the sample text is normalized to obtain a normalized feature vector;

[0025] The normalized feature vector is divided into multiple windows, and a multi-head self-attention mechanism is used to process the multiple windows to obtain the first feature vector.

[0026] The first feature vector is shifted through multiple windows, and a multi-head self-attention mechanism is used to process the shifted windows to obtain the second feature vector.

[0027] Normalize the second eigenvector to obtain the third eigenvector;

[0028] Alignment is performed based on the third feature vector and the sample text position information to obtain the sample visual features and sample layout features.

[0029] The document image key information extraction method based on knowledge distillation provided in this invention extracts features from sample document images using a convolutional neural network. The generated sample image features contain visual information of the document image. The sample text position information is normalized to ensure it is within a standardized range, facilitating subsequent feature processing and alignment. By dividing the normalized feature vector into multiple windows and employing a multi-head self-attention mechanism, local contextual information within each window can be effectively captured, generating a first feature vector. Then, the multiple windows of the first feature vector are shifted, and the multi-head self-attention mechanism is applied again to generate a second feature vector, further enriching the feature representation. After another normalization process, alignment is performed based on the third feature vector and the sample text position information to obtain sample visual features and sample layout features. The alignment process ensures the consistency and relevance of visual and layout features, improving the comprehensive representation capability of the features.

[0030] In one optional implementation, an initial teacher model is used to obtain sample text sequence labels based on sample text features, sample visual features, and sample layout features of sample document images, including:

[0031] The sample text features, sample visual features, and sample layout features of the sample document image are concatenated to obtain the sample comprehensive features;

[0032] The base training layer of the initial teacher model is used to extract contextual information from the comprehensive features of the samples based on a bidirectional long short-term memory network to obtain the fourth feature vector;

[0033] Conditional random fields are used to generate sample text sequence labels based on the fourth feature vector.

[0034] The document image key information extraction method based on knowledge distillation provided in this invention obtains comprehensive sample features by concatenating sample text features, sample visual features, and sample layout features of sample document images. The multimodal feature fusion method can make full use of text, visual, and layout information to generate more comprehensive and richer feature representations. Contextual information is extracted from the comprehensive sample features through a bidirectional long short-term memory network. The generated fourth feature vector not only contains the features of a single word but also the information of its preceding and following context, making the feature representation more complete and accurate. Conditional random fields are used to generate sample text sequence labels to ensure that the generated label sequence has high coherence and accuracy.

[0035] In one alternative implementation, the sample document image carries text content tags, text location tags, and sequence label tags;

[0036] Based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, a sample training loss is generated, including:

[0037] The first training loss is determined based on the difference between the sample text location information and the text location label;

[0038] The second training loss is determined based on the difference between the sample text content and the text content labels;

[0039] The third training loss is determined based on the difference between the sample text sequence labels and the sequence annotation labels;

[0040] The sum of the first training loss, the second training loss, and the third training loss is used as the sample training loss.

[0041] The document image key information extraction method based on knowledge distillation provided in this invention determines a first training loss based on sample text location information and text location labels, which helps the model learn and predict the location information of text in the image more accurately. It determines a second training loss based on sample text content and text content labels, which helps the model learn and predict the text content more accurately. It determines a third training loss based on sample text sequence labels and sequence label labels, which helps the model learn and predict the labels of text sequences more accurately. The sum of the above three losses is used as the sample training loss, and training is performed based on this, which improves the overall performance of the model and enhances the model's accuracy and robustness.

[0042] In one optional implementation, based on the initial student model, knowledge distillation is performed on the teacher model to obtain the student model, including:

[0043] Select the target distillation layer from the teacher model;

[0044] Input any sample document image into the teacher model and the initial student model, obtain the output difference between the teacher model and the initial student model at the target distillation layer, and generate the first distillation loss;

[0045] Based on the initial student model, the sample document images are extracted to obtain the initial text location information, initial text content, and initial text sequence labels;

[0046] A second distillation loss is generated based on the sample document image, initial text location information, initial text content, and initial text sequence labels;

[0047] The initial student model is trained based on the first and second distillation losses to obtain the student model.

[0048] The document image key information extraction method based on knowledge distillation provided in this invention selects a target distillation layer from the teacher model. For each target distillation layer, the output difference between the teacher model and the initial student model at that layer is obtained to generate a first distillation loss, which helps the student model maintain consistency with the teacher model at the key layer, thereby inheriting the capabilities of the teacher model. Based on the initial student model, the initial text location information, initial text content, and initial text sequence labels are extracted, and a second distillation loss is generated based on this information, which helps the student model perform more closely to the teacher model in actual tasks. The two distillation losses are combined to train the initial student model to obtain the final student model, which enables the student model to be optimized in multiple aspects, improves overall performance, and significantly reduces the size and computational complexity of the model, making it suitable for deployment on resource-constrained devices.

[0049] Secondly, the present invention provides a document image key information extraction device based on knowledge distillation, the device comprising:

[0050] The first determining module is used to detect, identify and sequence label any sample document image using the initial teacher model, and obtain the sample text location information, sample text content and sample sequence labeling information of the sample document image.

[0051] The extraction module is used to extract features from the sample document image based on the sample text location information, sample text content and sample sequence labeling information using the initial teacher model, so as to obtain the sample text features, sample visual features and sample layout features of the sample document image.

[0052] The second determining module is used to obtain sample text sequence labels based on the sample text features, sample visual features, and sample layout features of the sample document images using the initial teacher model;

[0053] The training module is used to generate a sample training loss based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, and to train the initial teacher model based on the sample training loss to obtain the teacher model.

[0054] The distillation module is used to perform knowledge distillation on the teacher model based on the initial student model to obtain the student model, and to extract key information from the target document image based on the student model.

[0055] Thirdly, the present invention provides a computer device, comprising: a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the document image key information extraction method based on knowledge distillation described in the first aspect or any corresponding embodiment thereof.

[0056] Fourthly, the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the knowledge distillation-based document image key information extraction method described in the first aspect or any corresponding embodiment thereof. Attached Figure Description

[0057] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0058] Figure 1 This is a flowchart of a document image key information extraction method based on knowledge distillation according to an embodiment of the present invention;

[0059] Figure 2 This is a flowchart of the initial teacher model according to an embodiment of the present invention;

[0060] Figure 3 This is a flowchart of knowledge distillation according to an embodiment of the present invention;

[0061] Figure 4 This is a structural block diagram of a document image key information extraction device based on knowledge distillation according to an embodiment of the present invention;

[0062] Figure 5 This is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present invention. Detailed Implementation

[0063] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0064] With the development of deep learning technology, neural networks are commonly used to extract key information from document images. However, using only text features to extract key information results in poor extraction performance. Furthermore, existing models employ Transformers, convolutional neural networks, and graph convolution techniques, significantly increasing the number of model parameters and thus raising the requirements for deployment environments, making model deployment difficult. The document image key information extraction method based on knowledge distillation provided in this invention trains an initial teacher model by integrating text features, visual features, and layout features to obtain a teacher model that accurately extracts key information. This teacher model undergoes knowledge distillation, inheriting key knowledge and maintaining high performance. It significantly reduces model complexity and resource consumption while ensuring the accuracy of key information extraction, thus meeting the requirements for model deployment.

[0065] According to an embodiment of the present invention, a method for extracting key information from document images based on knowledge distillation is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0066] This embodiment provides a method for extracting key information from document images based on knowledge distillation, which can be used in electronic devices. Figure 1 This is a flowchart of a document image key information extraction method based on knowledge distillation according to an embodiment of the present invention, such as... Figure 1 As shown, the process includes the following steps:

[0067] Step S101: For any sample document image, the initial teacher model is used to detect, identify, and label the sample document image, obtaining the sample text location information, sample text content, and sample sequence labeling information. Specifically, a document image refers to a photograph or scan of a paper document or other written material stored in digital form, typically containing text, charts, tables, handwritten notes, etc. Key information in a document image refers to data or text that has significant value or meaning in a specific context. This information is usually an indispensable part of decision-making, analysis, recording, or other business processes. The specific content of the key information depends on the application scenario and requirements. Key information extraction from document images realizes the transformation of key information from unstructured to structured, facilitating direct manipulation, processing, and analysis of the key information. In this embodiment of the invention, the initial teacher model is first used to perform preliminary extraction on any sample document image, obtaining sample text location information, sample text content, and sample sequence labeling information, providing a foundation for model training and also serving as the basis for key information extraction.

[0068] Step S102 involves using an initial teacher model to extract features based on the sample text location information, sample text content, and sample sequence labeling information of the sample document image, resulting in sample text features, sample visual features, and sample layout features of the sample document image. Specifically, the initial teacher model, based on the information initially extracted above, performs deeper feature extraction to obtain sample text features, sample visual features, and sample layout features. The sample text features represent the features of the text content; the sample visual features are the visual features of the sample document image, such as color and texture; and the sample layout features are the layout features of the text in the sample document image, such as line spacing and column spacing. Related technologies simply use text features to extract key information. This method not only ignores the visual information contained in the texture, color, and size of the image but also ignores the relative layout information between different text segments, resulting in poor extraction performance. This embodiment of the invention integrates multiple feature training models to extract key information, thereby improving extraction accuracy.

[0069] Step S103: Using the initial teacher model, sample text sequence labels are obtained based on the sample text features, sample visual features, and sample layout features of the sample document image. Specifically, by integrating the sample text features, sample visual features, and sample layout features, the initial teacher model is used for further processing and prediction to obtain sample text sequence labels. These sample text sequence labels are mainly used in the context to represent the specific attributes or categories of each word in the document image, which can accurately identify key entities and information in the document image, thereby achieving efficient information extraction and processing.

[0070] Step S104: Based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, a sample training loss is generated. The initial teacher model is then trained based on this sample training loss to obtain the teacher model. Specifically, the sample document image carries real text location information, text content, and text sequence labels. The sample training loss is generated based on this real information and the information obtained from the initial teacher model, which quantifies the key information extraction performance of the initial teacher model. The initial teacher model is then trained based on this sample training loss, and algorithms such as backpropagation or gradient descent are used to optimize its model parameters. After parameter optimization, the next sample document image is used for training, i.e., steps S101 to S104 are repeated until the number of training iterations reaches a preset threshold or the model parameters converge, resulting in a teacher model capable of accurately extracting key information.

[0071] Step S105: Based on the initial student model, knowledge distillation is performed on the teacher model to obtain the student model, and key information is extracted from the target document image based on the student model. Specifically, although the teacher model can accurately extract key information from the document image, its model structure includes multiple neural networks, resulting in high model complexity and making it unsuitable for practical application. Therefore, this embodiment of the invention uses an initial student model to perform knowledge distillation on the teacher model to learn the teacher model's ability to accurately extract key information, while also possessing a lightweight model structure that facilitates practical application.

[0072] The document image key information extraction method based on knowledge distillation provided in this invention can accurately detect and identify the text position and content in document images by using an initial teacher model. Through sequence labeling, more detailed text structure information can be obtained. Then, based on the above information, feature extraction is performed to obtain sample text features, sample visual features, and sample layout features of the sample document image, which can more comprehensively represent the content and structure of the document image. By integrating multiple features, the generated text sequence labels are more accurate, not only including text content, but also reflecting the structure and positional relationship of the text in the document. The initial teacher model is trained based on the difference between the information extracted by the model and the actual information to obtain a teacher model. Knowledge distillation is then performed on the teacher model, which can inherit the key knowledge of the teacher model and maintain high performance. While ensuring the accuracy of key information extraction, the complexity and resource consumption of the model are significantly reduced, which can meet the needs of model implementation.

[0073] This embodiment provides a method for extracting key information from document images based on knowledge distillation, which can be used in the aforementioned electronic devices. The method specifically includes the following steps:

[0074] Step S201: For any sample document image, the initial teacher model is used to detect, identify and sequence label the sample document image to obtain the sample text location information, sample text content and sample sequence label information of the sample document image.

[0075] Specifically, step S201 includes:

[0076] Step S2011 involves using the preprocessing layer of the initial teacher model. Based on optical character recognition (OCR) technology, the sample document image is detected and recognized to obtain the sample text location information and sample text content. Specifically, the preprocessing layer uses OCR technology to detect the coordinate position of each text region in the image, i.e., the sample text location information. Simultaneously, it can recognize the text string, i.e., the sample text content.

[0077] Step S2012: Annotate the sample text content using sequence labeling methods to obtain sample sequence labeling information. Specifically, the BIO sequence labeling method is used as an example, where B (Beginning) represents the start of an entity, I (Inside) represents the interior of an entity, and O (Outside) represents a word that does not belong to any entity. The BIO sequence labeling method is used to annotate the text content in the sample document image to obtain sample sequence labeling information. Assuming the sample text content is "Xiaoming works at a technology company in region A as a technical manager," the entity types to be labeled are defined as follows: PER (Person): person's name; LOC (Location): location; ORG (Organization): organization name; POS (Position): position. Xiaoming: is a person's name, therefore labeled B-PER; Zai: does not belong to any entity, therefore labeled O; A region: is a location, therefore labeled B-LOC; De: does not belong to any entity, therefore labeled O; Yijia: does not belong to any entity, therefore labeled O; Technology: is part of a company's name, therefore labeled B-ORG; Company: is part of a company's name, therefore labeled I-ORG; Job: does not belong to any entity, therefore labeled O; Serving as: does not belong to any entity, therefore labeled O; Tech: is part of a position, therefore labeled B-POS; Manager: is part of a position, therefore labeled I-POS. Punctuation marks are all labeled O. Through sequence labeling, key entities in the text content can be accurately identified, providing a foundation for subsequent information extraction and processing.

[0078] Step S202: Using the initial teacher model, feature extraction is performed based on the sample text location information, sample text content, and sample sequence annotation information of the sample document image to obtain the sample text features, sample visual features, and sample layout features of the sample document image.

[0079] In some optional implementations, step S202 above uses an initial teacher model to extract features based on the sample text content and sample sequence annotation information of the sample document image, obtaining sample text features of the sample document image, including:

[0080] Step S2021 involves using the base training layer of the initial teacher model to segment the sample text content and sequence labeling information of the sample document image into multiple sample sub-words. Specifically, the base training layer consists of an encoder and a decoder. The encoder performs preliminary processing on the input sample text content and sequence labeling information, such as segmentation, to obtain multiple sample sub-words. Common segmentation methods include Byte Pair Encoding (BPE) and WordPiece.

[0081] Step S2022: For any sample word, generate its word embedding vector and positional embedding vector. Specifically, each sample word is mapped to a fixed-dimensional vector space to obtain a word embedding vector. The word embedding vector captures the semantic information of the sample word. Simultaneously, to preserve the order information of the sample word within its sentence, each sample word also has a positional embedding vector. The positional embedding vector helps the model understand the positional relationships of the sample words.

[0082] Step S2023 employs a multi-head self-attention mechanism to obtain the feature vector corresponding to each sample sub-word based on its word embedding vector and position embedding vector. Specifically, the multi-head self-attention mechanism captures information from different aspects through multiple different attention heads. Each attention head focuses on different word relationships, thereby enhancing the model's expressive power. Through the multi-head self-attention mechanism, combining the word embedding vector and position embedding vector, a feature vector for each sample sub-word is generated. This feature vector not only contains the semantic information of the sample sub-word but also considers the positional relationships between sample sub-words.

[0083] Step S2024: The feature vectors corresponding to all sample sub-words are used as sample text features of the sample document image. Specifically, the feature vectors of all sample sub-words are combined to form the text feature representation of the entire sample document image, which serves as one of the bases for accurately extracting key information.

[0084] In some optional implementations, step S202 above uses an initial teacher model to extract features based on the sample text location information of the sample document image, obtaining the sample visual features and sample layout features of the sample document image, including:

[0085] Step S2025 involves using the base training layer of the initial teacher model to extract features from the sample document image based on a convolutional neural network, thereby obtaining sample image features. Specifically, the convolutional neural network in the encoder can extract local features from the sample document image through convolution operations to capture the visual information of the sample document image, serving as one of the bases for accurately extracting key information.

[0086] Step S2026 involves normalizing the positional information of the sample text to obtain a normalized feature vector. Specifically, the LN layer (Linear layer) in the encoder transforms the positional information of each text region in the sample document image, such as the coordinates of the top-left corner and bottom-right corner, to a fixed range to obtain a normalized feature vector. Normalization helps the model better handle data at different scales.

[0087] Step S2027 involves dividing the normalized feature vector into multiple windows and processing these windows using a multi-head self-attention mechanism to obtain the first feature vector. Specifically, the normalized feature vector is divided into multiple windows with a certain size and stride. The encoder uses multiple different attention heads of W-MS (Windows Multi-head Self Attention) to capture the relationships between different windows. Each attention head focuses on different window relationships, thereby enhancing the model's expressive power. The feature vector processed by the multi-head self-attention mechanism not only contains local information about the windows but also considers the relationships between them.

[0088] Step S2028 involves shifting multiple windows of the first feature vector and processing these shifted windows using a multi-head self-attention mechanism to obtain the second feature vector. Specifically, the multiple windows of the first feature vector are shifted with a certain step size to achieve information fusion and capture more contextual information. The encoder uses SW-MSA (ShiftWindows Multi-head Self Attention) to enable information interaction between pixels of different text elements during window shifting. Since each text element is located in a different position, layout information is indirectly constructed during the shifting process, thus obtaining the second feature vector and capturing a wider range of contextual information.

[0089] Step S2029: Normalize the second feature vector to obtain the third feature vector. Specifically, the encoder normalizes the second feature vector again through an LN layer to make its feature representation more stable.

[0090] Step S20210 involves aligning the sample visual features and sample layout features based on the third feature vector and the sample text location information. Specifically, the encoder uses ROI Align (Region of Interest Alignment). For each sample text region in the sample text location information, a Region of Interest (RoI) is defined with a bounding box of (x1, y1, x2, y2). Each RoI is divided into several smaller sub-regions. Within each sub-region, the coordinates of the sub-region's center point are calculated, and the feature value of that center point is obtained from the third feature vector using bilinear interpolation. The feature values of the center point of each sub-region are pooled to generate a final fixed-size feature vector. The feature vectors of all sub-regions included in the sample text location information are used as the sample layout features.

[0091] Step S203: Using the initial teacher model, sample text sequence labels are obtained based on the sample text features, sample visual features, and sample layout features of the sample document images.

[0092] Specifically, step S203 includes:

[0093] Step S2031 involves concatenating the sample text features, sample visual features, and sample layout features of the sample document image to obtain the comprehensive sample features. Specifically, concatenating these three features of the sample document image forms a multimodal comprehensive sample feature, which helps to extract key information more accurately.

[0094] Step S2032: Using the base training layer of the initial teacher model, contextual information is extracted from the comprehensive features of the samples based on a bidirectional long short-term memory network to obtain the fourth feature vector. Specifically, the decoder of the base training layer can capture contextual information from both directions through a bidirectional long short-term memory network (BiLSTM), thereby generating a richer feature representation, i.e., the fourth feature vector.

[0095] Step S2033: Using a Conditional Random Field (CRF), sample text sequence labels are generated based on the fourth feature vector. Specifically, the encoder generates sample text sequence labels from the fourth feature vector using a CRF. The fourth feature vector is input into the CRF layer, which generates the most probable label sequence, i.e., the sample text sequence labels, based on the dependency relationship between the feature vector and the labels.

[0096] In some optional implementations, sample text sequence labels provide a structured representation of key information in sample document images, making information extraction more efficient and accurate. These labels allow for the accurate identification and extraction of key information, i.e., data or text, from document images.

[0097] In some alternative implementations, Figure 2 This is a flowchart of the initial teacher model according to an embodiment of the present invention, such as... Figure 2As shown, the sample document image is input into the initial teacher model. The preprocessing layer obtains the sample text location information and sample text content through optical character recognition. Then, it is labeled using sequence labeling methods to obtain sample sequence label information. These three pieces of information are then input into the base training layer. The base training layer first performs word segmentation, then generates embedding vectors, and then obtains sample text features through a multi-head self-attention mechanism. Simultaneously, the base training layer obtains sample visual features through a convolutional neural network, and then aligns them through a combination of multiple neural networks to obtain sample layout features. This combination includes linear layers, window multi-head self-attention mechanisms, shift window multi-head self-attention mechanisms, and linear layers. Then, the base training layer concatenates these three features, and then processes them through a bidirectional long short-term memory network and a conditional random field to obtain sample text sequence labels. Based on these sample text sequence labels, key information in the sample document image can be obtained.

[0098] Step S204: Based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, generate a sample training loss, and train the initial teacher model based on the sample training loss to obtain the teacher model. The sample document image carries text content labels, text location labels, and sequence label labels.

[0099] In some optional implementations, step S204 above generates a sample training loss based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, including:

[0100] Step S2041: Based on the difference between the sample text location information and the text location labels, a first training loss is determined. Specifically, the sample text location information refers to the specific location information of the text in the sample document image as determined by the initial teacher model. For example, "work" is located at position (x2, y2) in the sample document image. The text location label is the true location of the text carried by the sample document image in the sample document image. For example, "work" is located at position (x1, y1) in the sample document image. By calculating the difference between the sample text location information and the text location labels, and quantifying this difference, for example, using mean squared error loss, the first training loss is obtained, which is used to evaluate the accuracy of the model in predicting text locations.

[0101] Step S2042: Determine the second training loss based on the difference between the sample text content and the text content label. Specifically, the sample text content refers to the text content in the sample document image determined by the initial teacher model. The text content label is the true content of the text carried by the sample document image within the sample document image. The difference between the sample text content and the text content label is calculated, for example, by using cross-entropy loss to quantify this difference, resulting in the second training loss, which is used to evaluate the model's accuracy in predicting text content.

[0102] Step S2043: Based on the difference between the sample text sequence labels and the sequence annotation labels, determine the third training loss. Specifically, the sample text sequence labels refer to the labels of the text sequences determined by the initial teacher model, representing the categories of each word in the text content. The sequence annotation labels are the true classifications of the text sequences carried by the sample document images. The difference between the sample text sequence labels and the sequence annotation labels is calculated, for example, by using CRF loss to quantify this difference, resulting in the third training loss, which is used to evaluate the accuracy of the model in predicting sequence labels.

[0103] Step S2044: The sum of the first training loss, the second training loss, and the third training loss is used as the sample training loss. Specifically, the three loss values are added together to obtain the total loss value, which is used to evaluate the overall performance of the model when processing the current sample. By comprehensively considering the losses from multiple aspects, the model is ensured to achieve good performance across different dimensions, thereby improving the overall training effect.

[0104] Step S205: Based on the initial student model, knowledge distillation is performed on the teacher model to obtain the student model, and key information is extracted from the target document image based on the student model.

[0105] In some optional implementations, step S205 above, based on the initial student model, performs knowledge distillation on the teacher model to obtain the student model, including:

[0106] Step S2051: Select target distillation layers from the teacher model. Specifically, the teacher model is a pre-trained model with superior performance, used to guide the learning of the student model. Select some target distillation layers from the teacher model; the outputs of these layers will be used to guide the learning of the student model. Target distillation layers typically provide rich feature information. In this embodiment, the LN layer-W-MS layer-SW-MSA layer-LN layer in step S202 is used as a whole structure, and this whole structure is used as a target distillation layer. Simultaneously, the ROI Align layer in step S20210 is used as a target distillation layer, and the BiLSTM layer-CRF layer in step S203 is used as a whole structure, and this whole structure is used as a target distillation layer. Optionally, the above selection of target distillation layers is merely an example and is not intended to be limiting.

[0107] Step S2052: Input any sample document image into the teacher model and the initial student model, obtain the output difference between the teacher model and the initial student model at the target distillation layer, and generate the first distillation loss. Specifically, the model structure of the initial student model is consistent with that of the teacher model, but the number of layers in each network is less than that of the teacher model. Select any sample document image and input it into the teacher model and the initial student model respectively. At each target distillation layer, obtain the output results of the teacher model and the initial student model respectively, determine the difference between the output results, and calculate the loss using any loss function. The sum of the losses obtained from the three target distillation layers is used as the first distillation loss to evaluate the difference in model performance between the initial student model and the teacher model.

[0108] Step S2053: Extract initial text location information, initial text content, and initial text sequence labels from the sample document image based on the initial student model. Specifically, the sample document image is processed by the initial student model to obtain the initial text location information, initial text content, and initial text sequence labels.

[0109] Step S2054: Based on the sample document image, initial text location information, initial text content, and initial text sequence labels, a second distillation loss is generated. Specifically, since the sample document image carries real text content labels, text location labels, and sequence label labels, referring to steps S2041 to S2043 above, the loss between the output of the initial student model and the true value is calculated respectively to obtain the second distillation loss, which is used to evaluate the performance of the initial student model in key information extraction.

[0110] Step S2055: Based on the first and second distillation losses, the initial student model is trained to obtain the student model. Specifically, the first and second distillation losses are added to obtain a comprehensive loss value. This comprehensive loss value is used to train the initial student model, adjusting the model parameters. After each adjustment, steps S2052 to S2055 are repeated to achieve better results in feature learning and task performance, resulting in the final student model. This student model not only learns the teacher model's ability to accurately extract key information, but also has a lighter model structure compared to the teacher model, saving computational resources and time, which is conducive to the practical application of the model.

[0111] In some alternative implementations, Figure 3 This is a flowchart of knowledge distillation according to an embodiment of the present invention, such as... Figure 3As shown, sample document images are input into the teacher model and the initial student model. After each target distillation layer, the difference between the outputs of the teacher model and the initial student model is determined, and the loss is calculated. The losses corresponding to the three target distillation layers are summed as the first distillation loss. Based on the difference between the initial text sequence labels obtained by the initial student model and the sequence label labels carried by the sample document images, the second distillation loss is calculated. The initial student model is trained based on the sum of the two distillation losses to obtain the student model.

[0112] The document image key information extraction method based on knowledge distillation provided in this invention can accurately detect and identify the text position and content in document images by using an initial teacher model. Through sequence labeling, more detailed text structure information can be obtained. Then, based on the above information, feature extraction is performed to obtain sample text features, sample visual features, and sample layout features of the sample document image, which can more comprehensively represent the content and structure of the document image. By integrating multiple features, the generated text sequence labels are more accurate, not only including text content, but also reflecting the structure and positional relationship of the text in the document. The initial teacher model is trained based on the difference between the information extracted by the model and the actual information to obtain a teacher model. Knowledge distillation is then performed on the teacher model, which can inherit the key knowledge of the teacher model and maintain high performance. While ensuring the accuracy of key information extraction, the complexity and resource consumption of the model are significantly reduced, which can meet the needs of model implementation.

[0113] This embodiment also provides a document image key information extraction device based on knowledge distillation. This device is used to implement the above embodiments and preferred embodiments, and details already described will not be repeated. As used below, the term "module" can be a combination of software and / or hardware that implements a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0114] This embodiment provides a document image key information extraction device based on knowledge distillation, such as... Figure 4 As shown, it includes:

[0115] The first determining module 401 is used to perform detection, recognition and sequence labeling on any sample document image using an initial teacher model, so as to obtain the sample text location information, sample text content and sample sequence labeling information of the sample document image.

[0116] The extraction module 402 is used to extract features based on the sample text location information, sample text content and sample sequence labeling information of the sample document image using the initial teacher model, so as to obtain the sample text features, sample visual features and sample layout features of the sample document image.

[0117] The second determining module 403 is used to obtain sample text sequence labels based on the sample text features, sample visual features, and sample layout features of the sample document image using the initial teacher model.

[0118] Training module 404 is used to generate sample training loss based on sample text location information, sample text content and sample text sequence labels of sample document images, and to train the initial teacher model based on the sample training loss to obtain the teacher model.

[0119] The distillation module 405 is used to perform knowledge distillation on the teacher model based on the initial student model to obtain the student model, and to extract key information from the target document image based on the student model.

[0120] In some alternative implementations, the first determining module 401 includes:

[0121] The recognition unit is a preprocessing layer that uses the initial teacher model. Based on optical character recognition technology, it detects and recognizes sample document images to obtain sample text location information and sample text content.

[0122] The annotation unit is used to annotate the sample text content based on the sequence annotation method to obtain sample sequence annotation information.

[0123] In some alternative implementations, the extraction module 402 includes:

[0124] The word segmentation unit is used to segment the sample text content and sample sequence annotation information of the sample document image using the base training layer of the initial teacher model, and obtain multiple sample sub-words.

[0125] The first generation unit is used to generate the word embedding vector and position embedding vector of any sample subword.

[0126] The first determining unit is used to obtain the feature vector corresponding to the sample sub-word by adopting a multi-head self-attention mechanism based on the word embedding vector and position embedding vector of the sample sub-word.

[0127] The second determining unit is used to take the feature vectors corresponding to all sample sub-words as sample text features of the sample document image.

[0128] In some alternative implementations, the extraction module 402 includes:

[0129] The first extraction unit is used to extract features from the sample document image based on the base training layer of the initial teacher model and a convolutional neural network to obtain the sample image features.

[0130] The first normalization unit is used to normalize the positional information of the sample text to obtain a normalized feature vector.

[0131] The partitioning unit is used to divide the normalized feature vector into multiple windows. A multi-head self-attention mechanism is used to process the multiple windows to obtain the first feature vector.

[0132] The shifting unit is used to shift multiple windows of the first feature vector, and a multi-head self-attention mechanism is used to process the shifted windows to obtain the second feature vector.

[0133] The second normalization unit is used to normalize the second eigenvector to obtain the third eigenvector.

[0134] Alignment units are used to align samples based on the third feature vector and sample text position information to obtain sample visual features and sample layout features.

[0135] In some alternative implementations, the second determining module 403 includes:

[0136] The splicing unit is used to splice the sample text features, sample visual features, and sample layout features of the sample document image to obtain the sample comprehensive features.

[0137] The third determining unit is used as the base training layer of the initial teacher model. Based on the bidirectional long short-term memory network, it extracts contextual information from the comprehensive features of the samples to obtain the fourth feature vector.

[0138] The second generation unit is used to generate sample text sequence labels based on the fourth feature vector using a conditional random field.

[0139] In some optional implementations, the sample document images carry text content labels, text location labels, and sequence labeling labels; the training module 404 includes:

[0140] The fourth determining unit is used to determine the first training loss based on the difference between the sample text location information and the text location label.

[0141] The fifth determination unit is used to determine the second training loss based on the difference between the sample text content and the text content label.

[0142] The sixth determining unit is used to determine the third training loss based on the difference between the sample text sequence labels and the sequence annotation labels.

[0143] The seventh determining unit is used to take the sum of the first training loss, the second training loss, and the third training loss as the sample training loss.

[0144] In some alternative implementations, the distillation module 405 includes:

[0145] The selection unit is used to select the target distillation layer from the teacher model.

[0146] The third generation unit is used to input any sample document image into the teacher model and the initial student model, obtain the output difference between the teacher model and the initial student model in the target distillation layer, and generate the first distillation loss.

[0147] The second extraction unit is used to extract sample document images based on the initial student model to obtain initial text location information, initial text content, and initial text sequence labels.

[0148] The fourth generation unit is used to generate the second distillation loss based on the sample document image, initial text location information, initial text content, and initial text sequence labels.

[0149] The training unit is used to train the initial student model based on the first distillation loss and the second distillation loss to obtain the student model.

[0150] Further functional descriptions of the above modules and units are the same as those in the corresponding embodiments described above, and will not be repeated here.

[0151] In this embodiment, the document image key information extraction device based on knowledge distillation is presented in the form of a functional unit. Here, a unit refers to an ASIC (Application Specific Integrated Circuit) circuit, a processor and memory that execute one or more software or fixed programs, and / or other devices that can provide the above functions.

[0152] This invention also provides a computer device having the above-described features. Figure 4 The device shown is a document image key information extraction device based on knowledge distillation.

[0153] Please see Figure 5 , Figure 5 This is a schematic diagram of the structure of a computer device provided in an optional embodiment of the present invention, such as... Figure 5As shown, the computer device includes one or more processors 10, memory 20, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The components communicate with each other via different buses and can be mounted on a common motherboard or otherwise installed as needed. The processors can process instructions executed within the computer device, including instructions stored in or on memory to display graphical information of a GUI on external input / output devices (such as display devices coupled to the interfaces). In some alternative implementations, multiple processors and / or multiple buses can be used with multiple memories and multiple memory modules, if desired. Similarly, multiple computer devices can be connected, each providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multiprocessor system). Figure 5 Take a processor 10 as an example.

[0154] Processor 10 may be a central processing unit, a network processor, or a combination thereof. Processor 10 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The programmable logic device may be a complex programmable logic device (CAMP), a field-programmable gate array (FPGA), a general-purpose array logic (GDA), or any combination thereof.

[0155] The memory 20 stores instructions executable by at least one processor 10 to cause at least one processor 10 to perform the method shown in the above embodiments.

[0156] The memory 20 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the computer device. Furthermore, the memory 20 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, the memory 20 may optionally include memory remotely located relative to the processor 10, and these remote memories may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0157] The memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk or solid-state drive; the memory 20 may also include a combination of the above types of memory.

[0158] The computer device also includes an input device 30 and an output device 40. The processor 10, memory 20, input device 30, and output device 40 can be connected via a bus or other means. Figure 5 Taking the example of a connection between China and Israel via a bus.

[0159] Input device 30 can receive input numerical or character information, and generate key signal inputs related to user settings and function control of the computer device, such as a touchscreen, keypad, mouse, trackpad, touchpad, joystick, one or more mouse buttons, trackball, joystick, etc. Output device 40 may include display devices, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors). The aforementioned display devices include, but are not limited to, liquid crystal displays, light-emitting diodes, displays, and plasma displays. In some alternative embodiments, the display device may be a touchscreen.

[0160] This invention also provides a computer-readable storage medium. The methods described above according to embodiments of the invention can be implemented in hardware or firmware, or implemented as computer code that can be recorded on a storage medium, or implemented as computer code downloaded via a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and then stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods shown in the above embodiments.

[0161] A portion of this invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to the invention through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.

[0162] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.

Claims

1. A method for extracting key information from document images based on knowledge distillation, characterized in that, The method includes: For any sample document image, the initial teacher model is used to detect, identify and sequence label the sample document image to obtain the sample text location information, sample text content and sample sequence labeling information of the sample document image; Using the initial teacher model, feature extraction is performed based on the sample text location information, sample text content, and sample sequence annotation information of the sample document image to obtain the sample text features, sample visual features, and sample layout features of the sample document image. Using the initial teacher model, sample text sequence labels are obtained based on the sample text features, sample visual features, and sample layout features of the sample document images; Based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, a sample training loss is generated, and the initial teacher model is trained based on the sample training loss to obtain the teacher model. Based on the initial student model, knowledge distillation is performed on the teacher model to obtain the student model, and key information is extracted from the target document image based on the student model. Specifically, the initial teacher model is used to extract features based on the sample text location information, sample text content, and sample sequence annotation information of the sample document image, resulting in sample text features, sample visual features, and sample layout features of the sample document image, including: Using the base training layer of the initial teacher model, the sample text content and sample sequence annotation information of the sample document image are segmented into multiple sample sub-words; For any sample sub-word, generate the word embedding vector and position embedding vector of the sample sub-word; A multi-head self-attention mechanism is adopted to obtain the feature vector corresponding to the sample sub-word based on the word embedding vector and position embedding vector of the sample sub-word; The feature vectors corresponding to all sample sub-words are used as the sample text features of the sample document image; Using the base training layer of the initial teacher model, feature extraction is performed on the sample document image based on a convolutional neural network to obtain the sample visual features; The sample text location information is normalized to obtain a normalized feature vector; The normalized feature vector is divided into multiple windows, and a multi-head self-attention mechanism is used to process the multiple windows to obtain the first feature vector. The first feature vector is shifted through multiple windows, and a multi-head self-attention mechanism is used to process the shifted windows to obtain the second feature vector. The second feature vector is normalized to obtain the third feature vector; The sample layout features are obtained by aligning the third feature vector and the sample text position information.

2. The method according to claim 1, characterized in that, For any sample document image, the initial teacher model is used to detect, identify, and label the sample document image to obtain sample text location information, sample text content, and sample sequence labeling information, including: Using the preprocessing layer of the initial teacher model, and based on optical character recognition technology, the sample document image is detected and recognized to obtain the sample text location information and the sample text content; The sample text content is annotated using sequence labeling methods to obtain the sample sequence labeling information.

3. The method according to claim 1, characterized in that, The initial teacher model is used to obtain sample text sequence labels based on the sample text features, sample visual features, and sample layout features of the sample document images, including: The sample text features, sample visual features, and sample layout features of the sample document image are concatenated to obtain the sample comprehensive features; Using the base training layer of the initial teacher model, contextual information is extracted from the comprehensive features of the samples based on a bidirectional long short-term memory network to obtain the fourth feature vector; The sample text sequence labels are generated using a conditional random field based on the fourth feature vector.

4. The method according to claim 1, characterized in that, The sample document images carry text content tags, text location tags, and sequence label tags; The sample training loss is generated based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, including: Based on the difference between the sample text location information and the text location label, a first training loss is determined; Based on the difference between the sample text content and the text content labels, a second training loss is determined; The third training loss is determined based on the difference between the sample text sequence labels and the sequence annotation labels; The sum of the first training loss, the second training loss, and the third training loss is used as the sample training loss.

5. The method according to claim 1, characterized in that, The process of obtaining a student model by knowledge distillation of the teacher model based on the initial student model includes: Select the target distillation layer from the teacher model; Input any sample document image into the teacher model and the initial student model, obtain the output difference between the teacher model and the initial student model in the target distillation layer, and generate the first distillation loss; Based on the initial student model, the sample document image is extracted to obtain initial text location information, initial text content, and initial text sequence labels; Based on the sample document image, the initial text location information, the initial text content, and the initial text sequence label, a second distillation loss is generated; The initial student model is trained based on the first distillation loss and the second distillation loss to obtain the student model.

6. A document image key information extraction device based on knowledge distillation, characterized in that, The device includes: The first determining module is used to perform detection, recognition and sequence labeling on any sample document image using an initial teacher model, so as to obtain the sample text location information, sample text content and sample sequence labeling information of the sample document image. The extraction module is used to extract features based on the sample text location information, sample text content and sample sequence labeling information of the sample document image using the initial teacher model, so as to obtain the sample text features, sample visual features and sample layout features of the sample document image. The second determining module is used to obtain sample text sequence labels based on the sample text features, sample visual features, and sample layout features of the sample document image using the initial teacher model; The training module is used to generate a sample training loss based on the sample text location information, sample text content, and sample text sequence labels of the sample document image, and to train the initial teacher model based on the sample training loss to obtain the teacher model. The distillation module is used to perform knowledge distillation on the teacher model based on the initial student model to obtain the student model, and to extract key information from the target document image based on the student model. Specifically, the extraction module is used for: Using the base training layer of the initial teacher model, the sample text content and sample sequence annotation information of the sample document image are segmented into multiple sample sub-words; For any sample sub-word, generate the word embedding vector and position embedding vector of the sample sub-word; A multi-head self-attention mechanism is adopted to obtain the feature vector corresponding to the sample sub-word based on the word embedding vector and position embedding vector of the sample sub-word; The feature vectors corresponding to all sample sub-words are used as the sample text features of the sample document image; Using the base training layer of the initial teacher model, feature extraction is performed on the sample document image based on a convolutional neural network to obtain the sample visual features; The sample text location information is normalized to obtain a normalized feature vector; The normalized feature vector is divided into multiple windows, and a multi-head self-attention mechanism is used to process the multiple windows to obtain the first feature vector. The first feature vector is shifted through multiple windows, and a multi-head self-attention mechanism is used to process the shifted windows to obtain the second feature vector. The second feature vector is normalized to obtain the third feature vector; The sample layout features are obtained by aligning the third feature vector and the sample text position information.

7. A computer device, characterized in that, include: The system includes a memory and a processor, which are interconnected. The memory stores computer instructions, and the processor executes the computer instructions to perform the document image key information extraction method based on knowledge distillation as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing the computer to execute the document image key information extraction method based on knowledge distillation as described in any one of claims 1 to 5.