A method, system, medium, device and terminal for extracting identification documents.

By improving the Transformer structure and Deformable DETR encoder, and combining ResNet network and perspective transformation, the accuracy problem of edge and vertex extraction of ID card images is solved. It is suitable for embedded devices and achieves efficient ID card image extraction.

CN116543409BActive Publication Date: 2026-06-30HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2023-03-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately extract edges and vertices from document images, resulting in residual background content and making them unsuitable for efficient computing on embedded devices.

Method used

By employing an improved Transformer architecture and combining ResNet and Deformable DETR encoders, the edge segments and vertices of the document are predicted through multi-scale feature extraction and attention mechanisms, and the complete document image is obtained by using perspective transformation.

Benefits of technology

It achieves efficient and accurate extraction of document images, is suitable for embedded devices, simplifies the detection process, and obtains regular rectangular document images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116543409B_ABST
    Figure CN116543409B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of image segmentation technology and discloses a method, system, medium, device, and terminal for extracting identification documents. The system utilizes an embedded device to acquire multispectral images of the identification document; it models the document image edge segments based on their straight-line geometric properties and infers from the contextual information of the global image; it extracts global image features using a ResNet network and then encodes them using a Deformable DETR encoder; in the first-stage decoding process, it predicts the document edge segments using an attention mechanism and learnable line segment and position queries; in the second-stage decoding process, it compares the correlation between line segment features and image features to predict the relative order between line segments, and obtains the complete edges, vertices, and bounding boxes of the document image through perspective transformation. This invention uses a two-part matching method to predict line segments, avoiding pre- and post-processing, simplifying the detection channel, and achieving true end-to-end processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image segmentation technology, and in particular relates to a method, system, medium, device and terminal for extracting identity document targets. Background Technology

[0002] Currently, with the acceleration of globalization, the flow of people both domestically and internationally, as well as cross-regional exchanges, is becoming increasingly frequent. This massive population movement is accompanied by the high-frequency use of identification documents. Identification documents serve as a means of identifying each citizen, both domestically and internationally, proving their identity, experience, or authority. In different countries, at different times, and in different environments, identification documents vary, but most record the holder's photograph, name, date of birth, address, and other personal information.

[0003] After acquiring multispectral images of the document, there are usually background images in the images. The complex and diverse backgrounds can interfere with the user's observation of the text and details in the document. The foreground document part, which contains the real identity information, is the valuable part and should be saved in the document information database.

[0004] The purpose of document image target extraction is to process multispectral document images, identify the document portion from the multispectral image, detect the document's edges and bounding boxes, remove complex and useless background parts, and then stretch and align the image to linearly transform the document portion into a rectangular image with regular borders. Furthermore, as a midstream task in the field of computer vision, document target extraction can also lay the foundation for downstream tasks such as optical character recognition and authenticity verification of documents.

[0005] Accurately and efficiently extracting the document area from the entire image is a pressing problem that needs to be solved. However, existing technologies, such as the object detection invention TOOD [C. Feng, Y. Zhong, Y. Gao et al. Tood: Task-aligned one-stage object detection[C].in: 2021 IEEE / CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2021: 3490-3499.], only detect the position of the document object in the image. Therefore, they can only obtain the bounding box and cannot accurately obtain the edge of the document. The target extracted based on the bounding box still contains some background content.

[0006] Instance segmentation models, such as SOLOv2 [X. Wang, R. Zhang, T. Kong, et al. Solov2: Dynamic and fastinstance segmentation[J]. Advances in Neural information processing systems,2020, 33: 17721-17732.], can obtain a mask, but the mask edges appear jagged and the corners are rounded, making it difficult to accurately infer the true vertices of the document based on the mask. The edges of the target extracted from the mask are not complete straight lines, have a jagged background, and cannot place the document in a horizontally aligned state.

[0007] Technically, most methods heavily rely on manually pre-defined candidate regions or dense center points, leading to significant redundancy in the prediction results. The steps of pre-dividing candidate regions and proposing dense center points are extremely time-consuming. This also necessitates the use of non-maximum suppression in post-processing, which embedded devices' neural network processors (NPUs) do not support, meaning they cannot be practically implemented in devices.

[0008] The existing methods have the following performance problems and shortcomings:

[0009] (1) Existing technology can only detect the position of the document target in the image and can only obtain the bounding box. It cannot accurately obtain the edge of the document. The target extracted based on the bounding box still has some background content.

[0010] (2) The edges of the mask obtained by the existing instance segmentation model are jagged with unevenness and the corners are rounded, so it is difficult to accurately infer the true vertex of the document based on the mask.

[0011] (3) The edge of the target extracted by the existing instance segmentation model based on the mask is not a complete straight line, but has a jagged background and cannot place the document in a horizontally aligned state.

[0012] (4) Although current methods achieve fairly high performance metrics on large benchmark datasets such as COCO [Lin TY, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]. in: European Conference on Computer Vision (ECCV), 2014: 740-755.] and ImageNet [Deng J, DongW, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]. in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009:248-255.], they are not very suitable for extracting ID card targets. The detection and segmentation results of the current best-performing algorithms trained on ID card data show that object detection can obtain object bounding boxes, but cannot obtain accurate contours. The edges of instance segmentation are uneven and jagged, especially when the ID card is held in the hand, the hand occludes the edges of the ID card to a certain extent, the edge distortion of the detection results is more serious, and it is impossible to determine the true vertices. Summary of the Invention

[0013] To address the problems existing in the prior art, the present invention provides a method, system, medium, device and terminal for extracting targets from documents, and particularly relates to a method, system, medium, device and terminal for extracting targets from multispectral images of general documents based on an improved Transformer structure.

[0014] This invention is implemented as follows: a document target extraction method, comprising: the system acquiring multispectral images of the document using an embedded device; modeling the document image edge segments using the geometric property that the line segments are straight lines, and inferring by combining the context information of the global image; extracting global image features using ResNet, and then encoding them using a Deformable DETR encoder; in the first-stage decoding process, predicting the document edge segments using an attention mechanism and learnable line segment queries and position queries; in the second-stage decoding process, comparing the correlation between line segment features and image features, predicting the relative order between line segments, and then obtaining the complete edges, vertices, and bounding boxes of the document image through perspective transformation.

[0015] Furthermore, the document target extraction method also includes: inputting the document image into a ResNet network; using the ResNet network to extract features from the global image to obtain multi-scale features; using a Deformable DETR encoder for feature enhancement and aggregation to obtain encoded features; the decoding part is divided into two stages. In the first stage, a multi-scale line segment decoder is used to decode and retrieve line segment features; the endpoints and confidence scores of each line segment are estimated through a line segment prediction module, and line segments with confidence scores higher than a threshold are considered as edge lines of the document; in the second stage, an edge decoder is used to compare the line segment features and guiding features to predict the relative arrangement order between edge lines, and the intersection points of adjacent edge lines are calculated sequentially to infer complete vertices and bounding boxes; perspective transformation is performed based on the estimated document edges and bounding boxes to complete the image extraction of the document portion.

[0016] Furthermore, global image feature extraction includes:

[0017] A ResNet network pre-trained in the Caffe framework is used as the backbone network to extract features from images. The CNN convolutional kernels in layers 3 to 5 of the ResNet network are replaced with DCNv2 convolutional kernels. A three-channel RGB ID card image is obtained, and the longer dimension (height or width) is scaled to 1024, while maintaining the aspect ratio of the image and scaling the other side, resulting in the network input. After passing through the ResNet network, feature maps of multiple different scales were extracted. The length and width of each feature layer are half that of the layer above.

[0018] ;

[0019] Extracting features from the last 3 layers of ResNet encoding We use 1×1 convolutions to reduce the number of channels; we then use 3×3 convolutions with a stride of 2 to downsample the last layer of features, resulting in 4 layers of features. As input to the encoder.

[0020] Furthermore, feature encoding based on the Deformable DETR structure includes:

[0021] The encoder portion of the Deformable Transformer structure is used as the neck network. The neck network enhances and aggregates the multi-scale features extracted from the backbone network, resulting in encoded multi-scale features. The Deformable Transformer encoder learns and encodes global features, enabling the document features to distinguish from background features. DeformableDETR collects feature maps of different scales extracted from the backbone network ResNet, fuses and encodes these features, expanding the single-layer features of the Transformer into multi-scale features.

[0022] The Deformable DETR encoder consists of 6 layers of Deformable DETR encoding modules. Each module includes a multi-scale deformable attention mechanism and a feedforward neural network. During the encoding process, hierarchical normalization is used to normalize the elements in the features. The encoding calculation process is shown in the following formula:

[0023] ;

[0024] Both the input and output feature maps have the same resolution, and they are encoded bit by bit. The size of the positional encoding is consistent with that of the feature map, and the feature is flattened into a one-dimensional vector when it is input to the first layer.

[0025] Furthermore, the multi-scale line segment decoder includes:

[0026] The encoder-encoded features are processed using a multi-scale decoder at three different scales. Decoding is performed by comparing the features obtained from line segment queries and encoder encoding. By analyzing the similarities between them, the line segment features in the document image are obtained through layer-by-layer decoding.

[0027] The line segment decoder uses a stacked 3-layer adjusted DETR decoder structure. Given query elements and feature elements as input, the position query embedding and the line segment query embedding are stacked to serve as the input query Q. The position query embedding and the line segment query embedding are obtained by embedding the number of queries T into the encoding dimension of 256; both are initially randomly generated trainable parameters. The position encoding and the ResNet-encoded features are then combined. The input key K is obtained by stacking the i-th layer. Directly... The i-th layer is directly used as the input value V, and there are three different dimensions of encoded features, which correspond to the cross-attention mechanism of the line segment decoding module of the input layer 3.

[0028] The output consists of location feature query results and line segment feature query results. These output results serve as the input to the next layer of the line segment decoder, and the output values ​​are collectively referred to as location queries. and line segment query The update process is represented as follows:

[0029] ;

[0030] ;

[0031] in, This represents the cross-attention mechanism. This refers to the self-attention mechanism.

[0032] After each use of the attention mechanism module, hierarchical normalization is performed using standardization. During line segment decoding, the number of query line segment features is determined, and the attention mechanism is used to decode and obtain the corresponding number of line segment features. The position query is implicitly updated during line segment decoding, specifically including:

[0033] (1) Query the number of line segments

[0034] There are T queries, each query equivalent to checking whether a corresponding line segment exists in the image at a different location; if a suitable answer exists, the result is returned, and T query results of a fixed set size are output. Each complete certificate image includes 4 obvious edge line segments, and the number T is set to be significantly larger than the number of edges in the certificate; preferably, T is set to 30.

[0035] (2) Line segment query and position query

[0036] Using queries to extract encoded features The model obtains the line segment features of interest to it. A joint representation of the queried line segment features is achieved using position queries and line segment queries. These queries are updated layer by layer in the three-layer model, and the final queried line segment features are stored in embedding vectors. Both line segment queries and position queries are embedding vectors with dimension [missing information]. .

[0037] During forward propagation, line segment queries are updated at each layer via an attention mechanism, while position queries are updated differently. A three-layer feedforward network is applied to the predicted cluster query results, resulting in a vector that serves as the bias for the position queries. This vector is then superimposed on the original position query results and used as the input for the position queries in the next layer. Position queries are updated during prediction as the network computes forward at each layer, with attention positions being dynamically and implicitly updated during prediction.

[0038] (3) Attention mechanism

[0039] Multi-head attention mechanism is used to adaptively aggregate the content of the key, allowing the network to pay attention to information from multiple different representation subspaces and different locations at the same time; by linearly splitting and stacking the query vector Q, key vector K, and value vector V, it achieves the effect of using multiple convolutional kernels in CNN, enabling the model to compute in parallel.

[0040] Q, K, and V are each linearly decomposed into M vectors of the same dimension, where M represents the number of attention heads, preferably 8. The 8 attention heads are then... A new matrix is ​​obtained by concatenating the first dimension; finally, it is compared with the projection matrix. Multiplying them together yields the characteristics of the multi-head attention mechanism. , Initial values ​​are randomly generated and parameters are updated as the network trains. Each attention head is calculated as follows:

[0041] ;

[0042] in, The final multi-head attention calculation is as follows:

[0043] ;

[0044] In the self-attention mechanism, Q, K, and V all originate from the same feature sequence X, and the calculation process is as follows:

[0045] ;

[0046] in, , and It is a weight matrix randomly generated from three initial values, and the parameters are updated as the network is trained.

[0047] (4) Line segment prediction module

[0048] The line segment decoder outputs T line segment features. The line segment prediction module performs parallel predictions on each query result, obtaining T line segment prediction results. Each result contains two parts: a line segment and its confidence score. (Location query...) and line segment query In the final dimension, the positions of the T line segments are obtained through a three-layer feedforward prediction network.

[0049] The calculation process of the feedforward network is as follows:

[0050] ;

[0051] The position of a line segment is represented by the coordinates of its two endpoints, and the prediction result is also the coordinates of its two endpoints. Unified representation:

[0052] ;

[0053] The line segment query is directly passed through a two-layer feedforward prediction network and activated using the softmax function to obtain the confidence score. :

[0054] ;

[0055] The confidence score represents the probability that a line segment belongs to the edge of the document. During the prediction phase, line segments with a confidence score below the threshold of 0.7 are filtered out as if no target was found. During the training phase, the confidence score of a line segment is included in the matching cost. In the training phase, the degree to which each predicted line segment matches each real edge is determined, and then the optimal target matching algorithm is used to find the S line segments that best match the real edge segments; the loss is calculated and backpropagated by comparing them with the real edges, using the Hungarian algorithm as the optimal matching algorithm.

[0056] The number of real line segments is expanded from S to T, and non-existent line segments are represented by empty line segments. The matching cost between predicted line segments and real line segments is calculated as follows:

[0057] ;

[0058] in, Indicates an indicator function, Indicated distance, and The segment distance and confidence level are incorporated into the calculation as balancing factors. The cost is calculated pairwise between each predicted segment and each actual segment, with the cost between an empty segment and a predicted segment set to 0, resulting in the cost matrix. The permutation scheme with the minimum total matching cost is searched in the cost matrix. :

[0059] ;

[0060] in, This represents each predicted line segment. This represents the most matching real line segment found in the Match; calculated using the Hungarian algorithm. The T predicted line segments are matched one-to-one with the T target real line segments.

[0061] Furthermore, the document edge decoder includes:

[0062] The edge decoder uses a cross-attention mechanism to determine the similarity between line segment queries and guiding features; it uses a self-attention mechanism to perform autoregression to decode aggregated line sequence queries, thereby predicting the relative arrangement order of document edge line segments.

[0063] The main structure of the edge decoder uses a 3-layer DETR decoding module. The input query value Q is a sequential query, derived from the output of the line segment decoder. The input key K and input value V are homologous, both derived from guiding features. The output is the result of line sequence lookup, equivalent to sequential features. This is then processed through a feedforward network structure to obtain the relative order of line segments. The edge decoding process is shown in the following equation:

[0064] ;

[0065] ;

[0066] The sequential features of line segments are obtained by using line sequence lookup, and the model encodes these features using an encoder. The enhanced guided model yields sequential features, which are called guided features; these features are then encoded by the encoder. Feature enhancement is performed using instance activation maps, and then the guided features are obtained by bitwise superposition with sequential embeddings. :

[0067] ;

[0068] in, It is consistent with the dimension of line sequence query. The embedding of dimensions is initialized randomly and updated as the network is trained. It is a weighted approach to target perception.

[0069] After edge decoding, the relative arrangement order features are obtained, and a feedforward network is used to predict the relative order between line segments; the four vertices of the document section are obtained by calculating the intersection points between adjacent line segments. , , , The bounding box of the document target is derived by taking the maximum and minimum values ​​of the vertex coordinates. And obtain the bounding box size. ;

[0070] .

[0071] Furthermore, the stretching and deformation of document images based on perspective transformation includes:

[0072] After extracting the target area of ​​the document from the image, perspective transformation is applied to stretch and deform it to the size of the target bounding box, resulting in an aligned rectangular document image. Perspective transformation achieves the stretching and deformation of the document target through linear image distortion. It projects the original image onto the target image while preserving straight lines in the original image. Perspective transformation is a three-dimensional spatial transformation; the calculation process for the perspective transformation of the coordinates of any pixel in three-dimensional space is shown in the following formula:

[0073] ;

[0074] in, , , , This indicates the generation of linear transformations such as rotation, scaling, and shearing. , This indicates that a translation transformation is generated. , This indicates the generation of a projection transformation. This represents the coordinates of a pixel in the original image. This represents the corresponding coordinates in the target image. Since the target image is in a two-dimensional plane, the coordinates in three-dimensional space are divided by... get :

[0075] ;

[0076] The interpolation method selected is bilinear interpolation, boundary mode. Perspective transformation is implemented using the `GetPerspectiveTransform` function in OpenCV. Input image and the coordinates of the four predicted vertices of the document. , , , The output is the four vertices of the target matrix. , , , .

[0077] (1) Loss function

[0078] The network's detection target consists of two parts: line segments and bounding boxes. Therefore, the loss function is composed of the line segment loss. And line sequence prediction loss It consists of two parts, and the loss of each image is:

[0079] ;

[0080] In line segment detection, a line segment is represented by two endpoints. Using... and To achieve balance, let and They are all on the same order of magnitude.

[0081] 1) Line segment loss

[0082] The line segment loss consists of two parts: the line loss of T predicted line segments that match S actual labeled line segments, and the confidence loss of whether it belongs to the edge of the document.

[0083] ;

[0084] in, Indicates an indicator function, Indicates the predicted line segment The two endpoints and the actual labeled line segment Between the two endpoints distance.

[0085] 2) Edge loss

[0086] The edge loss comes from the loss due to the relative arrangement order between predicted line segments;

[0087] ;

[0088] in, Represents cross-entropy loss, This indicates the predicted order of arrangement. This indicates the actual order of arrangement.

[0089] (2) Network training

[0090] Before training on the document dataset, a pre-trained ResNet network was used. During training, the optimizer was Adam, the learning rate was set to 0.0001, and the weight decay was set to 0.0001. The random inactivation rate of FFN neurons was 0.1. Four RTX 3090 GPUs were selected for training, with 8 training samples per batch and a maximum of 20 iterations. The deep learning framework OpenMMLab was selected.

[0091] Another object of the present invention is to provide a document target extraction system applying the aforementioned document target extraction method, the document target extraction system comprising:

[0092] The image acquisition module is used to acquire multispectral images of documents using embedded devices.

[0093] The feature extraction module is used to extract multi-scale features from the global image using the ResNet network.

[0094] The feature encoding module is used to perform feature encoding using the Deformable DETR encoder;

[0095] The perspective transformation module is used to obtain the edges, vertices, and bounding boxes of an ID card image through perspective transformation.

[0096] Another object of the present invention is to provide a computer device including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the steps of the document target extraction method described above.

[0097] Another object of the present invention is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the document target extraction method.

[0098] Another objective of this invention is to provide an information data processing terminal, which includes the aforementioned document target extraction system.

[0099] Based on the above technical solutions and the technical problems solved, the advantages and positive effects of the technical solution to be protected by this invention are as follows:

[0100] First, this invention proposes a novel and highly efficient end-to-end detection and segmentation method, transforming the task of detecting certificates and licenses into a direct set prediction task. It proposes a line segment detection-based approach, using line segments as the detection targets for segmenting certificates and licenses from the background. First, line segments are predicted, then the line sequence is predicted, resulting in neat and regular edge contours. Furthermore, it proposes for the first time using vertex coordinates to form vectors to represent the edge contours of certificates and licenses, which is highly effective for objects with polygonal contours.

[0101] This invention provides a document target extraction algorithm based on an improved Transformer structure. First, a ResNet network is used to extract multi-scale features from the global image. Then, a Deformable DETR encoder is used for feature enhancement and aggregation to obtain encoded features. The decoding part is divided into two stages. In the first stage, a multi-scale line segment decoder is used to decode the image, querying line segment features. Subsequently, a line segment prediction module estimates the endpoints and confidence scores of each line segment. Only line segments with confidence scores higher than a threshold are considered edge segments of the document. In the second stage, an edge decoder compares the line segment features with guiding features to predict the relative order of edge segments. For adjacent edge segments, the intersection points are calculated sequentially to infer the complete vertices and bounding boxes. Finally, based on the estimated document edges and bounding boxes, perspective transformation is performed to complete the extraction of the document portion of the image.

[0102] This invention utilizes embedded devices to acquire multispectral images of identification documents; it models the image based on the geometric property that the edge segments of the document image are straight lines, and infers from the contextual information of the global image; it uses ResNet to extract global features, and then uses a Deformable DETR encoder for encoding; in the first-stage decoding process, it uses an attention mechanism and learnable line segment and position queries to predict the edge segments of the identification document; in the second-stage decoding process, it compares the correlation between line segment features and image features to predict the relative order between line segments, and further derives the complete edges, vertices, and bounding boxes of the identification document.

[0103] Second, the document target extraction model based on the improved Transformer structure provided by this invention uses bipartite matching to predict line segments in a one-to-one manner, thereby avoiding pre- and post-processing, simplifying the detection channel, achieving true end-to-end processing, and finally using perspective transformation to quickly extract the regular rectangular document portion.

[0104] Third, as supplementary evidence of the inventive step of the claims of this invention, it is also reflected in the following important aspects:

[0105] (1) The technical solution of this invention fills a technical gap in the industry both domestically and internationally:

[0106] There has been a lack of methods for extracting identification documents and certificates both domestically and internationally. Traditional algorithms have very poor detection performance, and deep learning algorithms have drawbacks. Technically, they mostly rely heavily on bounding boxes or dense center points, which leads to a lot of redundancy and inevitably requires post-processing with non-maximum suppression. In terms of performance, they cannot achieve accurate extraction results and are not truly end-to-end models. This model fills the gap in the field of identification document and certificate extraction.

[0107] (2) The technical solution of the present invention overcomes technical bias:

[0108] Past detection algorithms require pre-proposing a large number of redundant boxes, and removing these boxes requires a lot of computation, making them unsuitable for embedded devices. This algorithm overcomes the bias that detection and segmentation methods are not truly applicable to small, portable devices. Attached Figure Description

[0109] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0110] Figure 1 This is a flowchart of the document target extraction method provided in the embodiments of the present invention;

[0111] Figure 2 This is a schematic diagram of the document target extraction method provided in the embodiment of the present invention;

[0112] Figure 3 This is a network structure diagram for target detection and extraction of document images provided in an embodiment of the present invention;

[0113] Figure 4 This is a structural diagram of the Deformable DETR encoder provided in an embodiment of the present invention;

[0114] Figure 5This is a structural diagram of the line segment decoder provided in an embodiment of the present invention;

[0115] Figure 6 This is a schematic diagram of the multi-head attention mechanism provided in an embodiment of the present invention;

[0116] Figure 7 This is a schematic diagram of the line segment prediction module provided in an embodiment of the present invention;

[0117] Figure 8 This is a structural diagram of the document edge decoder provided in an embodiment of the present invention;

[0118] Figure 9 This is a schematic diagram of the image deformation process based on perspective transformation provided in an embodiment of the present invention. Detailed Implementation

[0119] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0120] To address the problems existing in the prior art, the present invention provides a method, system, medium, device, and terminal for extracting identity documents. The present invention will be described in detail below with reference to the accompanying drawings.

[0121] To enable those skilled in the art to fully understand how the present invention is specifically implemented, this section provides an explanatory description of the embodiments that expand upon the technical solutions of the claims.

[0122] like Figures 1-2 As shown, the document target extraction method provided in this embodiment of the invention includes the following steps:

[0123] S101 uses an embedded device to acquire multispectral images of the document and inputs them into the ResNet network;

[0124] S102, using the ResNet network to extract features from the global image to obtain multi-scale features;

[0125] S103 uses a Deformable DETR encoder to encode the features, resulting in the encoded features;

[0126] S104, use the DETR decoder to perform the first stage of decoding to obtain the certificate edge line segment;

[0127] S105, the DETR decoder is used for the second stage of decoding to obtain the relative order between line segments and to deduce the complete vertices and contours;

[0128] S106 uses perspective transformation to perform linear transformation, correcting the certificate target into a rectangular image with neat edges.

[0129] Figure 3 This is a diagram illustrating the overall structure of the document image target detection and extraction network provided in this embodiment of the invention. The document image is input into the network, first passing through a ResNet network to extract multi-scale features from the global image; then, a Deformable DETR encoder is used for feature enhancement and aggregation to obtain encoded features. The decoding part is divided into two stages. In the first stage, a multi-scale line segment decoder is used to decode the image, retrieving line segment features. Subsequently, a line segment prediction module estimates the endpoints and confidence scores of each line segment. Only line segments with confidence scores higher than a threshold are considered edge segments of the document. In the second stage, an edge decoder compares the line segment features with guiding features to predict the relative order of edge segments. Intersections are calculated sequentially for adjacent edge segments to infer complete vertices and bounding boxes. Finally, based on the estimated document edges and bounding boxes, perspective transformation is performed to complete the extraction of the document portion of the image.

[0130] As a preferred embodiment, the document target extraction method provided by this invention specifically includes:

[0131] 1. Global Image Feature Extraction

[0132] This invention uses a ResNet network pre-trained within the Caffe framework as the backbone network for feature extraction from images. Numerous studies have demonstrated that the ResNet network possesses strong feature extraction capabilities and can avoid gradient vanishing and exploding.

[0133] This invention replaces the CNN convolutional kernels in layers 3 to 5 of the ResNet network with DCNv2 convolutional kernels, increasing the ResNet network's ability to adapt to geometric changes in documents in multispectral images. This makes the model's attention during the convolution feature extraction process closer to the target structure, and avoids the problem of the convolutional kernel's sampling range exceeding the region of interest. This is a technique to improve performance in detection and segmentation tasks.

[0134] Given a three-channel RGB ID card image, which can be captured under any spectrum and has no restrictions on height and width, first scale the longer dimension (height or width) to 1024, while maintaining the aspect ratio of the image, and then scale the other side to obtain the network input. Then, after passing through a ResNet network, feature maps of multiple different scales are extracted. The length and width of each feature layer are half of the previous layer:

[0135]

[0136] This invention extracts features from the last 3 layers. However, due to the large number of channels in each layer, they cannot be directly used as input to the encoder. Therefore, a 1×1 convolution is used to reduce the number of channels. The last layer of features is downsampled once using a 3×3 convolution with a stride of 2, resulting in a total of 4 layers of features. As input to the encoder.

[0137] 2. Feature encoding based on Deformable DETR structure

[0138] This invention provides a neck network designed to enhance and aggregate multi-scale features extracted from the backbone network, resulting in encoded multi-scale features. This invention fully utilizes the encoder portion of a Deformable Transformer (Deformable DETR) structure as the neck network. The main function of the Deformable Transformer encoder is to learn and encode global features, enabling the document features to be distinguished from background features as much as possible. In application, this invention adjusts the original structure to better suit the application scenario.

[0139] Intuitively, Deformable DETR encoding is an attention mechanism based on sparse space sampling. Since large-scale features contain more detailed information and small-scale features contain more semantic information, most advanced object detection models today benefit from multi-scale structures. Deformable DETR encoding is similar; it collects feature maps of different scales extracted by the backbone network ResNet, automatically fuses and encodes these features, naturally expanding the single-layer features of the Transformer into multi-scale features.

[0140] Figure 4 This is the structure of a Deformable DETR encoder, consisting of six layers of Deformable DETR encoding modules. Each module includes a Multi-Scale Deformable Attention (MSDeformAttn) mechanism and a Feedforward Neural Network (FFN). During the encoding process, LayerNorm is used to normalize the elements in the features. The entire encoding process is calculated as follows:

[0141]

[0142] Both the input and output feature maps have the same resolution, and they are encoded bit-by-bit, with the size of the positional encoding matching that of the feature map. The features are flattened into a one-dimensional vector when input to the first layer.

[0143] like Figure 4 As shown, the input includes multi-scale features and reference points. The multi-scale features are positionally encoded by stacking them bit by bit, and the size of the positional encoding is the same as the size of the feature map. The output is the encoded features, whose dimensions and size are consistent with the multi-scale features before encoding.

[0144] 2.1 Multi-scale line segment decoder

[0145] The role of a multi-scale decoder is to analyze features at three different scales. Decoding is performed by comparing line segment lookup and encoding features. The similarity between the features is used to decode the line segments in the document image layer by layer. The structure of the line segment decoder is as follows: Figure 5 As shown, the line segment decoder consists of three layers of adjusted DETR decoder structures.

[0146] Given query elements and feature elements as input, the location query embedding and the line segment query embedding are concatenated and used as the input query Q. The location query embedding and the line segment query embedding are obtained by embedding the number of queries T into the encoding dimension of 256; both are initially randomly generated trainable parameters. The location encoding is then combined with the encoded features. The input key K is obtained by stacking the i-th layer. Directly... The i-th layer is directly used as the input value V, and there are a total of 3 different dimensions of encoded features, which correspond to the cross-attention mechanism of the line segment decoding module of the input 3 layers.

[0147] The output consists of location feature query results and line segment feature query results, which then serve as inputs to the next layer of the line segment decoder. For clarity, this invention no longer distinguishes between input and output, referring to both values ​​collectively as location query. and line segment query The entire update process is represented as follows:

[0148]

[0149]

[0150] in, This represents the cross attention mechanism. This represents the SelfAttention mechanism. After each use of the attention mechanism module, LayerNorm is used for hierarchical normalization.

[0151] The core of line segment decoding lies in first determining the number of line segment features to be queried, and then using an attention mechanism to decode and obtain the corresponding number of line segment features, implicitly updating the position query during the process. These parts will be explained in detail below:

[0152] (1) Query the number of line segments

[0153] There are a total of T queries. Each query checks if a corresponding line segment exists in the image at a different location. If a suitable answer is found, the result is returned. This means that regardless of the type of certificate image, a fixed set of T query results will always be output.

[0154] Each complete certificate image has four distinct edge segments, but the number of query segments T cannot be set to 4. The value of T needs to be significantly greater than the number of edges on the certificate. This is because the line segment detection module of this model indiscriminately detects line segments in the image during prediction. Complex background areas in the certificate image may contain other line segments that interfere with the detection. Setting T too small may lead to missed detection of certificate edge segments, while setting it too large will introduce more meaningless parameters to the model and interfere with subsequent bounding box prediction. Experiments show that setting T to 30 is most suitable.

[0155] (2) Line segment query and position query

[0156] The purpose of the query is to extract encoded features The model seeks to identify the line segment features of interest. While line segment queries alone can achieve this, the results are inferior. Position queries, on the other hand, are faster. Therefore, this model uses a combination of position and line segment queries to represent the queried line segment features. These two queries are updated layer by layer in the three-layer model, and the final retrieved line segment features are stored in these two embedding vectors. Both line segment and position queries are embedding vectors with dimensions of [dimension 1]. .

[0157] During forward propagation, line segment queries are updated at each layer via an attention mechanism, while position queries are updated differently. A three-layer feedforward network is applied to the predicted cluster query results, resulting in a vector that serves as the bias for the position queries. This vector is then superimposed on the original position query results and used as the input for the position queries in the next layer. Therefore, position queries are updated during prediction as each layer is computed forward. This is done to dynamically and implicitly update the attention positions during prediction.

[0158] (3) Attention mechanism

[0159] Multi-head attention adaptively aggregates the content of the keys. This allows the network to simultaneously focus on information from multiple different representation subspaces and locations. Through simple linear splitting and stacking of the query vector Q, key vector K, and value vector V, it achieves an effect similar to using multiple convolutional kernels in CNNs, enabling the model to compute in parallel and capture richer features. The structure of multi-head attention is as follows: Figure 6 As shown. First, Q, K, and V are linearly decomposed into M vectors of the same dimension, where M represents the number of attention heads, typically chosen as 8. The 8... Concatenating the matrices in the first dimension yields a new matrix. To maintain dimensional consistency, this is then compared with the projection matrix. Multiplying them together yields the characteristics of the multi-head attention mechanism. .

[0160] Figure 6 In this process, Q, K, and V are first linearly decomposed into M vectors of the same dimension, where M represents the number of attention heads, typically chosen to be 8. These 8 attention heads are then... Concatenating the matrices in the first dimension yields a new matrix. To maintain dimensional consistency, this is then compared with the projection matrix. Multiplying them together yields the characteristics of the multi-head attention mechanism. , Initial values ​​are randomly generated and parameters are updated as the network trains. Each attention head is calculated as follows:

[0161]

[0162] in, This is used to avoid [the issue]. The final multi-head attention calculation is as follows:

[0163]

[0164] Self-attention and cross-attention are two different applications of multi-head attention mechanisms. The difference lies in the source of their query vector Q, key vector K, and value vector V. In self-attention, Q, K, and V all originate from the same source X, and the calculation process is as follows:

[0165]

[0166] in, , and It is a weight matrix randomly generated from three initial values, and the parameters are updated as the network is trained.

[0167] Cross-attention and self-attention mechanisms differ structurally. The sources of Q, K, and V in cross-attention are different. Cross-attention can aggregate vectors with different dimensions and meanings and compare the similarity between content from different sources.

[0168] (4) Line segment prediction module

[0169] The line segment decoder outputs T line segment features. The line segment prediction module performs parallel predictions on each query result, obtaining T line segment prediction results. Each result contains two parts: a line segment and its confidence level. The structure of the line segment prediction module is as follows: Figure 7 As shown.

[0170] Location query and line segment query In the last dimension concatenation, the positions of the T line segments can be obtained through a three-layer feedforward prediction network (FFN).

[0171] The calculation process of the feedforward network is as follows:

[0172]

[0173] The position of a line segment is represented by the coordinates of its two endpoints; therefore, the prediction result is also the coordinates of its two endpoints. Unified representation:

[0174]

[0175] The line segment query is directly passed through a two-layer feedforward prediction network (FFN), and then activated using a softmax function to obtain the confidence score. :

[0176]

[0177] The confidence score represents the probability that a line segment belongs to the edge of the document. During the prediction phase, line segments with a confidence score below the threshold of 0.7 are filtered out as if no target was found. During the training phase, the confidence score of a line segment is included in the matching cost.

[0178] The challenge in the training phase of this invention lies in the fact that the model predicts a fixed set of line segments T, which is significantly larger than the set of line segments S representing the actual document edges. Therefore, during training, it is necessary to first determine the degree to which each predicted line segment matches each actual edge, then use an optimal target matching algorithm to find the S line segments that best match the actual edge segments, and finally compare them with the actual edges to calculate the loss and backpropagate. This invention uses the Hungarian algorithm (a bipartite graph matching algorithm) as the optimal matching algorithm.

[0179] First, the number of real line segments is expanded from S to T, and non-existent line segments are represented by empty line segments. The matching cost between predicted line segments and real line segments is calculated as follows:

[0180]

[0181] in, Indicates an indicator function, Indicated distance, and The line segment distance and confidence level are included in the calculation as a balancing factor.

[0182] Therefore, the cost value can be calculated pairwise between each predicted line segment and each actual line segment, and the cost value between an empty line segment and a predicted line segment can be assigned to 0, thus obtaining the cost matrix. The permutation scheme with the minimum total matching cost is searched in the cost matrix. :

[0183]

[0184] in, This represents each predicted line segment. This represents the most matching real line segment found in the Match. According to [Stewart, RJ, Andriluka, M., Ng, AY: End-to-end people detection incrowded scenes. In: CVPR (2015)], the Hungarian algorithm can efficiently calculate... The T predicted line segments are matched one-to-one with the T target real line segments.

[0185] Compared to the method of pre-selecting candidate regions and dense center points in commonly used CNN models, this search and matching method achieves the same effect. The difference is that this method will always find a set of line segment predictions without repetition. This strengthens permutation invariance and avoids the non-maximum suppression step of autoregressive models, ensuring that each query yields a set of one-to-one line segment matching results.

[0186] 2.2 Document Edge Decoder

[0187] Given the predicted line segments in the image, these segments cannot be directly used as the document edges because they may originate from background elements other than the document edges and lack a defined order. The complete edge of the document is derived from the sequential connection of four edge segments.

[0188] This invention designs an edge decoder that uses a cross-attention mechanism to determine the similarity between line segment queries and guiding features, and then uses a self-attention mechanism to perform autoregression to decode aggregated line sequence queries, thereby predicting the relative arrangement order of document edge line segments.

[0189] Figure 8 This describes the structure of the document edge decoder. The main structure of the edge decoder is the same as the line segment decoder, using a 3-layer DETR decoding module. The input query value Q is a sequential query, derived from the output of the line segment decoder. The input key K and input value V are homologous, both derived from guiding features. The output is the result of line sequence lookup, equivalent to sequence features. After passing through structures such as a feedforward network (FFN), the relative order of line segments can be obtained. The edge decoding process is shown in the following equation:

[0190]

[0191]

[0192] The purpose of line sequence lookup is to obtain the sequential characteristics of line segments. The model uses these characteristics. The enhanced guided model yields sequential features, which this invention refers to as guided features. For the encoded features... Feature enhancement is performed using Instance Activation Maps (IAM), and then the guided features are obtained by bitwise overlay with sequential embeddings. :

[0193]

[0194] in, It is consistent with the dimension of line sequence query. The embedding of dimensions is initialized with random values ​​and updated as the network is trained. It is a weighted method for target perception, which can highlight the information area representing the edge features of the document and increase the discriminative power of the feature vector.

[0195] After edge decoding, the relative arrangement order features are obtained, and a feedforward network is used to predict the relative order between line segments. Then, by simply calculating the intersection points between adjacent line segments, the four vertices of the document section can be obtained. , , , Furthermore, by taking the maximum and minimum values ​​of the vertex coordinates, the bounding box of the document target is derived. :

[0196]

[0197] The size of the bounding box can then be easily determined. .

[0198] The input query value Q is the line sequence query, derived from the line segment query in the line segment decoder. The input key K and value V are homologous, both originating from the minimum dimension of the encoded global features. The output is the result of the line sequence query, equivalent to the sequence features. A feedforward network and a Softmax structure are then used for regression to predict the line sequence values ​​of the edge line segments. These values ​​are then arranged from largest to smallest to obtain the relative order between the line segments.

[0199] 3. Perspective Transformation-Based Stretching and Deformation of ID Card Images

[0200] This invention extracts the target area of ​​the document from the image and stretches and deforms it to the size of the target box through perspective transformation to obtain an aligned rectangular document image.

[0201] Perspective transformation achieves the stretching and deformation of the document target through simple linear image distortion. It projects the original image onto the target image, preserving straight lines in the original image as straight lines after the transformation. The parallelism and angles between line segments may change after the transformation, but it will not cause distortion of the text or patterns on the document. The perspective transformation process is as follows: Figure 9 As shown.

[0202] Perspective transformation is a three-dimensional spatial transformation. The calculation process of the perspective transformation of the coordinates of any pixel in three-dimensional space is shown in the following formula:

[0203]

[0204] in, , , , This indicates the generation of linear transformations such as rotation, scaling, and shearing. , This indicates that a translation transformation is generated. , This indicates the generation of a projection transformation. This represents the coordinates of a pixel in the original image. This represents the corresponding coordinates in the target image. Since the target image is in a two-dimensional plane, the coordinates in three-dimensional space are divided by... get :

[0205]

[0206] The interpolation method selected is bilinear interpolation, boundary mode. The perspective transformation is implemented using the `GetPerspectiveTransform` function in OpenCV, which can be easily ported to embedded devices. Input image and the coordinates of the four predicted vertices of the document. , , , The output is the four vertices of the target matrix. , , , This part only involves simple linear transformations and is not used in the training of deep learning models.

[0207] 3.1 Loss Function

[0208] The target detected by the network in this invention consists of two parts: line segments and borders. Therefore, the loss function is also composed of line segment loss. And line sequence prediction loss It consists of two parts, and the loss of each image is:

[0209]

[0210] Please note that line segment detection differs from object detection tasks, where objects are represented by rectangles. A line segment only requires two endpoints to be represented.

[0211] Experiments showed that document objects in the document dataset are generally large, occupying a major area of ​​the image. Therefore, the detected line segments are also large, leading to a significant increase in the order of magnitude of the line segment loss. To make... and They are all on the same order of magnitude, using and To achieve balance.

[0212] (1) Line segment loss

[0213] The line segment loss consists of two parts: the line loss of T predicted line segments that match S actual labeled line segments, and the confidence loss based on whether the line segment belongs to the edge of the document.

[0214]

[0215] in, Indicates an indicator function, Indicates the predicted line segment The two endpoints and the actual labeled line segment Between the two endpoints distance.

[0216] (2) Edge loss

[0217] The edge loss is derived from the loss due to the relative arrangement order between the predicted line segments:

[0218]

[0219] in, Represents cross-entropy loss, This indicates the predicted order of arrangement. This indicates the actual order of arrangement.

[0220] 3.2 Network Training

[0221] To achieve a better fit, a pre-trained ResNet network was used before training on the document dataset.

[0222] During training, the optimizer was Adam, the learning rate was set to 0.0001, and the weight decay was set to 0.0001. The random neuron inactivation rate for FFN was 0.1. Four RTX 3090 GPUs were used for training throughout the experiment, with a batch size of 8 training samples and a maximum of 20 epochs. The framework used in the experiment was SenseTime's deep learning framework, OpenMMLab.

[0223] The document target extraction system provided in this embodiment of the invention includes:

[0224] The image acquisition module is used to acquire multispectral images of documents via embedded devices;

[0225] The feature extraction module is used to extract multi-scale features from the global image using the ResNet network.

[0226] The feature encoding module is used to perform feature encoding using the Deformable DETR encoder;

[0227] The perspective transformation module is used to obtain the edges, vertices, and bounding boxes of an ID card image through perspective transformation.

[0228] The algorithm provided in this invention has been applied to a handheld document verification device. This device can capture document images of various spectra, such as white light, infrared, and ultraviolet light. When viewing the captured document images using the album function, the algorithm can be used to extract the images, save the extracted images to storage space, and then upload the extracted documents to a data center for archiving.

[0229] The embodiments of the present invention have achieved some positive results during the research and development or use process, and have indeed great advantages compared with the prior art. The following content describes them in conjunction with the data, charts and other information of the experimental process.

[0230] TOOD is currently the top-tier object detection algorithm, but it can only obtain bounding boxes and cannot accurately obtain the edges of the document. SOLOv2 is the current top-tier instance segmentation and detection algorithm, which can obtain a binary mask. The edges of the mask are jagged and the corners are rounded, making it difficult to accurately infer the true vertices of the document based on the mask.

[0231] Example: The selected camera will quickly autofocus and then capture a video stream, displaying it frame by frame on the screen. When the shutter button is pressed, the target frame in the middle of the image is saved to main memory. The album playback function allows playback of all captured photos and selective extraction of any ID image. In this case, the image is fed into the target extraction network proposed in this invention within the NPU, where the background is removed and the image is shaped into a standard rectangle before being saved back to main memory.

[0232] After the image is acquired, the deep learning algorithm (the function of the image processing software module) integrated in the main control board will be used to process and recognize the image; the card part in the image will be segmented out and then deformed to make it into a standard rectangular image.

[0233] It should be noted that embodiments of the present invention can be implemented in hardware, software, or a combination of both. The hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated-design hardware. Those skilled in the art will understand that the above-described devices and methods can be implemented using computer-executable instructions and / or included in processor control code, for example, such code provided on a carrier medium such as a disk, CD, or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The devices and modules of the present invention can be implemented by hardware circuitry such as very large-scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field-programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of the above-described hardware circuitry and software, such as firmware.

[0234] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications, equivalent substitutions, and improvements made by those skilled in the art within the scope of the technology disclosed in the present invention, and within the spirit and principles of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for extracting target documents, characterized in that, The document target extraction method includes: acquiring multispectral images of the document using an embedded device; modeling the document image edge segments as straight lines using geometric properties, and inferring from the context information of the global image; extracting global image features using a ResNet network, and then encoding them using a Deformable DETR encoder; predicting the document image edge segments using an attention mechanism and learnable line segment and position queries during the first-stage decoding process; and comparing the correlation between line segment features and image features during the second-stage decoding process to predict the relative order between line segments, and then obtaining the complete edges, vertices, and bounding boxes of the document image through perspective transformation. The document image is input into a ResNet network. The ResNet network is used to extract features from the global image, obtaining multi-scale features. A Deformable DETR encoder is used for feature enhancement and aggregation, resulting in encoded features. The decoding process is divided into two stages. In the first stage, a multi-scale line segment decoder is used to decode and retrieve line segment features. The endpoints and confidence scores of each line segment are estimated using a line segment prediction module. Line segments with confidence scores higher than a threshold are considered edge segments of the document. In the second stage, an edge decoder compares the line segment features with guiding features to predict the relative order of edge segments. Intersections are calculated for adjacent edge segments to infer complete vertices and bounding boxes. Perspective transformation is performed based on the estimated document edges and bounding boxes to complete the image extraction of the document portion. Global image feature extraction includes: A ResNet network pre-trained in the Caffe framework is used as the backbone network to extract features from images. The CNN convolutional kernels in layers 3 to 5 of the ResNet network are replaced with DCNv2 convolutional kernels. A three-channel RGB ID card image is obtained, and the longer dimension (height or width) is scaled to 1024 while maintaining the image's height-to-width ratio. The other dimension is then scaled to obtain the network input. After passing through the ResNet network, feature maps of multiple different scales were extracted. The length and width of each feature layer are half that of the layer above. ; Extract features from the last 3 layers We use 1×1 convolutions to reduce the number of channels; we then use 3×3 convolutions with a stride of 2 to downsample the last layer of features, resulting in 4 layers of features. As input to the encoder; Feature encoding based on the Deformable DETR structure includes: The encoder part of the Deformable Transformer structure is used as the neck network; the neck network is used to enhance and aggregate the multi-scale features extracted from the backbone network to obtain the encoded multi-scale features; the Deformable Transformer encoder is used to learn and encode global features, so that the features of the document part can be distinguished from the background features; DeformableDETR collects feature maps of different scales extracted by the backbone network ResNet, fuses the features and encodes them, thus expanding the single-layer features of the Transformer into multi-scale features. The Deformable DETR encoder consists of 6 layers of Deformable DETR encoding modules. Each module includes a multi-scale deformable attention mechanism and a feedforward neural network. During the encoding process, hierarchical normalization is used to normalize the elements in the features. The encoding calculation process is shown in the following formula: ; Both the input and output feature maps have the same resolution, and they are encoded bit by bit. The size of the positional encoding is consistent with that of the feature map, and the feature is flattened into a one-dimensional vector when it is input to the first layer.

2. The document target extraction method as described in claim 1, characterized in that, The multi-scale line segment decoder includes: Utilizing a multi-scale decoder to process features at three different scales Decoding is performed by comparing line segment lookup and encoding features. By analyzing the similarities between them, the line segment features in the document image are obtained through layer-by-layer decoding. The line segment decoder is a 3-layer stacked adjusted DETR decoder structure. Given query elements and feature elements as input, the position query embedding and the line segment query embedding are stacked and used as the input query Q. The position embedding and the line segment query embedding are obtained by embedding the number of queries T into the encoding dimension of 256, and both are initially randomly generated trainable parameters. The position encoding and the encoded features are then combined. The input key K is obtained by stacking the i-th layer; directly... The i-th layer is used as the input value V, which has three different dimensions of encoded features, corresponding to the cross-attention mechanism of the line segment decoding module of the input 3 layers; The output consists of location feature query results and line segment feature query results. These output results serve as the input to the next layer of the line segment decoder, and the output values ​​are collectively referred to as location queries. and line segment query The update process is represented as follows: ; ; in, This represents the cross-attention mechanism. This indicates the mechanism of self-attention; After each use of the attention mechanism module, hierarchical normalization is performed using standardization. During line segment decoding, the number of query line segment features is determined, and the attention mechanism is used to decode and obtain the corresponding number of line segment features. The position query is implicitly updated during line segment decoding, specifically including: (1) Query the number of line segments There are T queries in total. Each query is equivalent to checking whether a corresponding line segment exists in the image at a different location. If a suitable answer exists, the result is returned, and T query results of a fixed set size are output. Each complete document image includes 4 obvious edge line segments. The number T is set to be significantly larger than the number of edges of the document. T is set to 30. (2) Line segment query and position query Using queries to extract encoded features The model obtains the line segment features of interest to it. The queried line segment features are jointly represented by position query and line segment query. These queries are updated layer by layer in the three-layer model, and the final queried line segment features are stored in an embedding vector. Both the line segment query and the position query are embedding vectors with dimensions of [dimension 1]. ; During forward propagation, line segment queries are updated at each layer through an attention mechanism, while position queries are updated differently. A 3-layer feedforward network is applied to the predicted cluster query results to obtain a vector as the bias of the position query, which is superimposed on the original position query results and used as the input for the position query of the next layer. The position query is updated during prediction as each layer of the network performs forward computation, and the attention positions are dynamically and implicitly updated during prediction. (3) Attention mechanism Multi-head attention mechanism is used to adaptively aggregate the content of the key, allowing the network to pay attention to information from multiple different representation subspaces and different locations at the same time; by linearly splitting and stacking the query vector Q, key vector K, and value vector V, the effect of using multiple convolutional kernels in CNN is achieved, allowing the model to compute in parallel; Q, K, and V are each linearly decomposed into M vectors of the same dimension, where M represents the number of attention heads, and M is 8. A new matrix is ​​obtained by concatenating the first dimension; finally, it is compared with the projection matrix. Multiplying them together yields the characteristics of the multi-head attention mechanism. , Initial values ​​are randomly generated and parameters are updated as the network trains; each attention head is calculated as follows: ; in, The final multi-head attention calculation is as follows: ; The calculation process of Q, K, and V in the self-attention mechanism is as follows: ; in, , and It is a weight matrix randomly generated from three initial values, and the parameters are updated as the network is trained; (4) Line segment prediction module The line segment decoder outputs T line segment features. The line segment prediction module is used to predict each query result in parallel, obtaining T line segment prediction results. Each result contains two parts: a line segment and a confidence score for that line segment. The location query... and line segment query In the final dimension stitching, the positions of the T line segments are obtained through a three-layer feedforward prediction network; The calculation process of the feedforward network is as follows: ; The position of a line segment is represented by the coordinates of its two endpoints, and the prediction result is also the coordinates of its two endpoints. Unified representation: ; The line segment query is directly passed through a two-layer feedforward prediction network and activated using the softmax function to obtain the confidence score. : ; The confidence level represents the probability that a line segment belongs to the edge of the document. During the prediction phase, line segments with a confidence level below the threshold of 0.7 are directly filtered out as if no target was found. During the training phase, the confidence level of the line segment is included in the matching cost. During the training phase, the degree to which each predicted line segment matches each real edge is determined, and then the optimal target matching algorithm is used to find the S line segments that best match the real edge line segments. The loss is calculated and backpropagated by comparing with the real edges, and the Hungarian algorithm is used as the optimal matching algorithm. The number of real line segments is expanded from S to T, and non-existent line segments are represented by empty line segments; the matching cost between predicted line segments and real line segments is calculated as follows: ; in, Indicates an indicator function, Indicated distance, and As a balancing factor, the segment distance and confidence level are both included in the calculation; the cost value is calculated pairwise between each predicted segment and each real segment, and the cost value between an empty segment and a predicted segment is assigned to 0, resulting in the cost matrix. The permutation scheme with the minimum total matching cost is searched in the cost matrix. : ; in, This represents each predicted line segment. This represents the most matching real line segment found in the Match; calculated using the Hungarian algorithm. The T predicted line segments are matched one-to-one with the T target real line segments.

3. The document target extraction method as described in claim 2, characterized in that, The document edge decoder includes: The edge decoder uses a cross-attention mechanism to determine the similarity between line segment queries and guiding features; it uses a self-attention mechanism to perform autoregression to decode aggregated line sequence queries, thereby predicting the relative arrangement order of document edge line segments. The main structure of the edge decoder uses a 3-layer DETR decoding module; the input query value Q is a sequential query, derived from the output of the line segment decoder. The input key K and input value V are of the same origin, both derived from guiding features; the output is the result of line sequence query, equivalent to sequential features, which is then processed through a feedforward network structure to obtain the relative order of line segments. The edge decoding process is shown in the following formula: ; ; The sequential characteristics of line segments are obtained by using line sequence lookup, and then encoded using an encoder. The enhanced guided model yields sequential features, which are called guided features; the encoded features... Feature enhancement is performed using instance activation maps, and then the guided features are obtained by bitwise superposition with sequential embeddings. : ; in, It is consistent with the dimension of line sequence query. The embedding of dimensions is initialized randomly and updated as the network is trained. It is a weighted method for target perception; After edge decoding, the relative arrangement order features are obtained, and a feedforward network is used to predict the relative order between line segments; the four vertices of the document section are obtained by calculating the intersection points between adjacent line segments. , , , The bounding box of the document target is derived by taking the maximum and minimum values ​​of the vertex coordinates. And obtain the bounding box size. ; 。 4. The document target extraction method as described in claim 3, characterized in that, Perspective-based image stretching and deformation of ID cards includes: After extracting the target area of ​​the document from the image, perspective transformation is applied to stretch and deform it to the size of the target bounding box, resulting in an aligned rectangular document image. Perspective transformation stretches and deforms the document target through linear image deformation. Perspective transformation projects the original image onto the target image, while preserving straight lines in the original image as straight lines after transformation. Perspective transformation is a three-dimensional spatial transformation, and the calculation process of the perspective transformation of the coordinates of any pixel in three-dimensional space is shown in the following formula: ; in, , , , This indicates the generation of linear transformations such as rotation, scaling, and shearing. , This indicates that a translation transformation is generated. , This indicates the generation of a projection transformation. This represents the coordinates of a pixel in the original image. This represents the coordinates corresponding to the target image. Since the target image is in a two-dimensional plane, the coordinates in three-dimensional space are divided by... get : ; The interpolation method selected is bilinear interpolation, boundary mode; the perspective transformation is implemented using the GetPerspectiveTransform function in OpenCV; the input image and the coordinates of the four predicted vertices of the document are used. , , , The output is the four vertices of the target matrix. , , , ; (1) Loss function The network's detection target consists of two parts: line segments and bounding boxes. Therefore, the loss function is composed of the line segment loss. And line sequence prediction loss It consists of two parts, and the loss of each image is: ; In line segment detection, a line segment is represented by two endpoints; using and To achieve balance, let and They are all on the same order of magnitude; 1) Line segment loss The line segment loss consists of two parts: the line loss of T predicted line segments that match S actual labeled line segments, and the confidence loss of whether it belongs to the edge of the document. ; in, Indicates an indicator function, Indicates the predicted line segment The two endpoints and the actual labeled line segment Between the two endpoints distance; 2) Line sequence prediction loss The line sequence prediction loss comes from the loss due to the relative arrangement order between predicted line segments; ; in, Represents cross-entropy loss, This indicates the predicted order of arrangement. Indicates the actual arrangement order; (2) Network training Before training on the document dataset, a pre-trained ResNet network was used. During training, the optimizer was Adam, the learning rate was set to 0.0001, and the weight decay was set to 0.0001. The random inactivation rate of FFN neurons was 0.

1. Four RTX3090 GPUs were selected for training, the number of training samples per batch was 8, the maximum number of iterations was 20, and the deep learning framework openmmlab was selected.

5. A document target extraction system applying the document target extraction method as described in any one of claims 1 to 4, characterized in that, The document target extraction system includes: Image acquisition module, used to acquire multispectral images of documents using embedded devices; The feature extraction module is used to extract multi-scale features from the global image using the ResNet network. The feature encoding module is used to perform feature encoding using the Deformable DETR encoder; The perspective transformation module is used to obtain the edges, vertices, and bounding boxes of an ID card image through perspective transformation.

6. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the steps of the document target extraction method as described in any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the document target extraction method as described in any one of claims 1 to 4.

8. An information data processing terminal, characterized in that, The information data processing terminal is used to implement the document target extraction system as described in claim 5.