A document processing method fusing multi-stage image enhancement and cross-modal verification

By employing multi-stage image enhancement and cross-modal verification methods, the problems of low recognition accuracy and poor batch processing efficiency of blurry and edge-trunculated documents in existing technologies are solved, achieving efficient and accurate document processing, and making it suitable for automated scanning and processing of manufacturing documents.

CN122200718APending Publication Date: 2026-06-12SHANGHAI ELECTRIC GRP DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI ELECTRIC GRP DIGITAL TECH CO LTD
Filing Date
2026-03-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies have low recognition accuracy when processing fuzzy or truncated documents, cannot effectively handle complex elements such as tables, columns, and seals, and have poor batch processing efficiency, resulting in low data standardization and limiting the automation of document processing workflows.

Method used

A multi-stage image enhancement and cross-modal verification approach is adopted, including image enhancement and restoration, cross-modal feature modeling, cross-modal verification, and dynamic parallel processing. Through ultra-wide edge filling, two-stage contrast enhancement, dynamic binarization, and multi-round morphological processing, combined with the multi-modal architecture of ResNet and BERT models, the collaborative extraction and verification of image, text, and layout features are achieved, and an adaptive dynamic thread pool is used for efficient processing.

🎯Benefits of technology

It achieves higher efficiency and accuracy in document processing, improves the recognition accuracy of complex document elements, solves the problems of scanning defects and low batch processing efficiency, and provides a better document processing solution.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200718A_ABST
    Figure CN122200718A_ABST
Patent Text Reader

Abstract

The application discloses a kind of fusions multi-stage image enhancement and cross-modal check's bill processing method, including bill image acquisition step, multi-stage image enhancement step, multi-modal feature alignment step, cross-modal feature extraction step, high-precision OCR identification step, cross-modal cross-checking step, dynamic parallel processing step, data standardization and archiving step, multi-modal technology runs through the whole process of feature extraction and check.The fusion multi-stage image enhancement and cross-modal check's bill processing method of the application, by constructing "image enhancement repair-cross-modal feature modeling-cross-modal check-dynamic parallel processing" whole process technical framework, relies on multi-stage image enhancement repair scanning defect, cross-modal feature check corrects OCR error, CPU adaptive dynamic thread pool improves parallel efficiency, realizes the high efficiency and precision of bill scanning processing, provides more optimal bill processing scheme for intelligent manufacturing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of document processing and artificial intelligence technology, and in particular to a document processing method that integrates multi-stage image enhancement and cross-modal verification. It is applicable to the automated scanning and processing of documents in the manufacturing industry, and focuses on solving existing technical problems such as fuzzy document recognition, complex element processing, and low batch efficiency. Background Technology

[0002] In the field of document scanning and processing, existing technologies have significant shortcomings and are unable to meet the core needs of manufacturing enterprises for document processing: (1) OCR (Optical Character Recognition) technology has a weak ability to process documents with scanning defects such as edge truncation and blurring, and it does not combine the characteristics of the text in the document to design a targeted enhancement scheme, resulting in the inability to effectively recognize the broken strokes of the text. (2) Relying solely on text features for recognition cannot accurately handle complex document elements such as table structures, cross-column content, and seal-related text, resulting in insufficient recognition accuracy; (3) The existing technology has not established a cross-modal verification mechanism. OCR recognition errors are easily transmitted directly to the final output, resulting in low data standardization. In addition, the serial processing architecture restricts the overall process speed and limits the automation development of document processing. Summary of the Invention

[0003] The purpose of this invention is to overcome the shortcomings of existing document scanning and processing technologies, such as OCR's inability to handle blurry or truncated documents, low recognition accuracy for complex elements such as tables, columns, and seals, and poor batch processing efficiency. This invention provides a document processing method that integrates multi-stage image enhancement and cross-modal verification. By constructing a full-process technical framework of "image enhancement and repair – cross-modal feature modeling – cross-modal verification – dynamic parallel processing," it relies on multi-stage image enhancement to repair scanning defects, cross-modal feature verification to correct OCR errors, and a CPU adaptive dynamic thread pool to improve parallel efficiency. This achieves high efficiency and accuracy in document scanning and processing, providing a better document processing solution for intelligent manufacturing.

[0004] The technical solution to achieve the above objectives is: a document processing method integrating multi-stage image enhancement and cross-modal verification, comprising the following steps: S1, Document Image Acquisition Steps: Receive scanned images of paper documents or electronic documents, and store the images using a lossless compression format; S2, Multi-stage image enhancement steps: Design a four-stage enhancement scheme of "ultra-wide edge filling + two-stage contrast enhancement + dynamic binarization + multi-round morphological processing" to repair image defects and obtain the enhanced document image; S3, Multimodal Feature Alignment Step: Based on the enhanced document image in step S2, integrate multimodal information and perform cross-modal feature alignment to obtain integrated multimodal data; S4, Cross-modal feature extraction step: Based on the multimodal data integrated in step S3, a fusion architecture of ResNet and BERT language models is constructed to achieve collaborative extraction of features from image, text, and layout modalities. S5, High-Precision OCR Recognition Steps: Employs a hybrid strategy of "direct extraction priority + OCR recognition" to improve recognition accuracy. S6, Cross-modal cross-validation step: Establish a cross-validation mechanism for three types of modal features to verify the document data, correct OCR recognition errors, and avoid error propagation; S7, Dynamic Parallel Processing Step: For the document data verified in step S6, efficient batch processing of documents is achieved through an adaptive dynamic thread pool; S8, Data Standardization and Archiving Steps: Standardize and securely store the documents processed in step S7.

[0005] In the above-mentioned document processing method that integrates multi-stage image enhancement and cross-modal verification, in step S1, a scanning device or electronic document import module is used to collect scanned images of paper documents or electronic documents. The images may contain scanning defects such as edge truncation, blurring, and uneven lighting.

[0006] The document processing method described above, which integrates multi-stage image enhancement and cross-modal verification, includes the following process in step S2: The four-stage enhancement scheme specifically includes the following steps: S21, Ultra-wide Edge Fill: Based on statistical analysis of manufacturing document samples, it was determined that a 25-pixel white edge should be added around the document image to cover more than 99% of edge truncation scenarios, fill in the edge printing truncation area, and avoid missing edge fields; S22, Dual-stage contrast enhancement: The first stage normalizes grayscale values ​​to 0-255 using a linear transformation formula, achieving global stretching and resolving text loss caused by excessively dark local areas; the linear transformation formula is: , In equation (1), L(x,y) is the original gray value, Lmin is the minimum gray value of the image, Lmax is the maximum gray value of the image, and G(x,y) is the normalized gray value. The second stage uses the CLAHE algorithm to divide the image into 36 local regions and specifically enhance the contrast of low-light areas. S23, Dynamic Binarization: First, the document image is denoised using 3×3 Gaussian blur. Then, the OTSU thresholding method is combined with inverse color binarization. When the pixel gray value is <T, it is set to 255 to represent the text foreground; when the pixel gray value is ≥T, it is set to 0 to represent the background. It automatically adapts to documents with different gray distributions. S24, morphological processing: Perform the following operations in sequence: dilate 2×2 rectangular structural elements to thicken thin strokes; erode 1×1 structural elements to remove tiny noise points with an area < 3 pixels; dilate 2×2 rectangular structural elements to repair broken pixels.

[0007] The document processing method described above, which integrates multi-stage image enhancement and cross-modal verification, includes the following process in step S3: Multi-modal feature alignment. S31, Multimodal Data Fusion: Image, structured text, and semi-structured text data are acquired through interfaces respectively; Principal component analysis is used to normalize the dimensions of the above multi-source data, mapping the pixel features of images, word vector features of text, and structural features of tables to the same 1024-dimensional feature space; outliers in the fused feature set are filtered out to remove invalid data with high ambiguity. S32, Cross-modal feature alignment: Based on the image coordinate system, establish coordinate mapping relationships for each modality of data, bind the table line edge coordinates extracted in S24 with the cell numbers of the table text, and generate the "Table Coordinate-Text Position Mapping Table"; Based on the semantic similarity calculation of the BERT language model, correct the semantic deviation between the text and image content, calculate the cosine similarity between the printed text identified in the image and the preset field names, set a threshold ≥0.8, and if it is lower than the threshold, it is corrected by mapping through the thesaurus; Perform consistency verification between the handwritten text and the preset semantics of the corresponding field, and mark the semantic conflict as an item to be verified.

[0008] The document processing method described above, which integrates multi-stage image enhancement and cross-modal verification, includes the following process in step S4: Cross-modal feature extraction. S41, Image and table feature extraction based on ResNet model: 512-dimensional feature vectors are output through convolutional layers to calculate the texture histogram of the seal area, which is used to distinguish the seal from the text; the gradient of the table line edges is extracted based on the Canny edge detection algorithm to determine the coordinates of the table column boundaries, with a fitting error ≤ 1 pixel; S42, Text feature extraction based on the BERT language model: The 768-dimensional text embedding vector is output through the Transformer encoder, and the semantic similarity between adjacent characters is calculated. The semantic similarity adopts cosine similarity and a threshold of ≥0.7 is set to correct semantic conflicts. For documents containing handwriting, the stroke width and tilt angle of the handwriting area are extracted to distinguish between handwriting and printed text and avoid mixed recognition errors. S43, Based on the layout feature extraction of the image processed in step S24: Use connected component analysis to determine the coordinates of the upper left and lower right corners of each column and establish a "column coordinate-semantic label mapping table"; calculate the overlapping area between the text region and adjacent columns. If the overlap rate is ≥30%, it is determined to be a cross-column. Record the original position association relationship of the cross-column text to correct the cross-column attribution error.

[0009] The document processing method described above, which integrates multi-stage image enhancement and cross-modal verification, includes the following process in step S5: High-precision OCR recognition. S51, Direct extraction of copyable text: Call the "Field Coordinates-Semantic Tag Mapping Table" output in step S4 to obtain the boundary coordinates of each text field and accurately locate the text area to be extracted; count the length of the text directly extracted from each field. If the text length is ≥1, it is determined to be a valid extraction, and the text is associated with the field label and stored; if the text length is <1, the field is marked as "to be identified" and the second stage is entered. S52, CRNN (Convolutional Recurrent Neural Network) + CTC (Connectionist Temporal Classification) recognition: A partial image of the "to be recognized" field is cropped to avoid interference from irrelevant areas; the cropped partial image is uniformly scaled to a height of 64 pixels with adaptive width, and the image channel input to the CRNN model is set to grayscale; CTC beam search decoding is used, with a beamwidth of 10, to select the character sequence with the highest probability from the character probability distribution output by the fully connected layer.

[0010] The document processing method described above, which integrates multi-stage image enhancement and cross-modal verification, includes the following process in step S6: S61, Image-Text Modal Verification: Match the image features from step S4 with the OCR text from step S5, calculate the matching degree, which is the Euclidean distance, with a threshold of 0.85. If the matching degree is <0.85, retrieve the template library and delete the misidentified text. S62, Text-to-Text Modality Validation: Based on the text features of S4, calculate the semantic coherence of the text. If the semantic similarity between adjacent characters is <0.7, call the terminology database and perform semantic prediction correction based on the BERT language model. S63, Layout-Text Modality Validation: Compare the layout features from step S4 with the column coordinates. If the text area coordinates exceed the column range, correct the attribution. If the cross-column text is not labeled with association, supplement the label according to the cross-column feature to correct the semantic break.

[0011] In the aforementioned document processing method that integrates multi-stage image enhancement and cross-modal verification, step S8, data standardization and archiving, specifically includes the following processes: S81, Extract Key Data Structures: Extract key fields from documents and convert them into JSON / XML format; S82, Secure Storage: Structured data is encrypted and stored using the AES-256 encryption algorithm; S83, Multi-dimensional Search: Build an index library to support searching by document type, date, and number.

[0012] The document processing method of this invention, which integrates multi-stage image enhancement and cross-modal verification, achieves a breakthrough improvement in document processing performance and accuracy through three major technological innovations: multi-stage image enhancement, cross-modal verification, and dynamic parallel processing. Compared with existing technologies, it has the following beneficial effects: A multi-stage image enhancement scheme is proposed to specifically address the scanning defects of manufacturing documents; (2) Construct a multimodal architecture that integrates ResNet (Residual Network) and BERT (Bidirectional Encoder Representations from Transformers) to achieve cross-modal feature extraction, alignment and verification; (3) Develop a CPU adaptive thread pool to balance processing efficiency and resource consumption. Attached Figure Description

[0013] Figure 1 This is a flowchart of the document processing method that integrates multi-stage image enhancement and cross-modal verification according to the present invention. Detailed Implementation

[0014] To enable those skilled in the art to better understand the technical solution of the present invention, its specific embodiments will be described in detail below with reference to the accompanying drawings.

[0015] Please see Figure 1This invention provides a document processing method integrating multi-stage image enhancement and cross-modal verification. The method includes document image acquisition, multi-stage image enhancement, multi-modal feature alignment, cross-modal feature extraction, high-precision OCR recognition, cross-modal cross-verification, dynamic parallel processing, and data standardization and archiving. Multi-modal technology is integrated throughout the entire feature extraction and verification process. A detailed description follows.

[0016] S1, Document Image Acquisition Steps: Use a scanning device or electronic document import module to acquire scanned images of paper documents or electronic documents. The images may contain scanning defects such as edge truncation, blurring, and uneven lighting. The images are stored in a lossless compression format to ensure that the original features are not lost.

[0017] S2, Multi-stage Image Enhancement Steps: To address scanning defects, a four-stage enhancement scheme is designed: "ultra-wide edge filling + two-stage contrast enhancement + dynamic binarization + multi-round morphological processing." This scheme repairs image defects, yielding an enhanced document image that provides a high-quality image foundation for subsequent cross-modal feature extraction. The specific process includes the following steps: S21, Ultra-wide Edge Fill: Based on statistical analysis of manufacturing document samples (covering 12 types of documents including purchase orders, inbound orders, outbound orders, return orders, transfer orders, contract approval orders, payment application orders, sales orders, customs declarations, electricity bills, water bills, and expense reports), it was determined to add a 25-pixel white edge (pixel value 255) around the image, covering more than 99% of edge truncation scenarios, filling the edge printing truncation area, and avoiding missing edge fields.

[0018] S22, Dual-stage contrast enhancement: The first stage normalizes grayscale values ​​to 0-255 using a linear transformation formula, achieving global stretching and resolving text loss caused by excessively dark local areas; the linear transformation formula is: , In equation (1), L(x,y) is the original gray value, Lmin is the minimum gray value of the image, Lmax is the maximum gray value of the image, and G(x,y) is the normalized gray value. The second stage uses the CLAHE algorithm to divide the image into 36 local regions and specifically enhance the contrast of low-light areas. S23, Dynamic Binarization: First, the image is denoised by applying a 3×3 Gaussian blur (standard deviation σ=1.0). Then, the OTSU thresholding method is combined with inverse color binarization (cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU). When the pixel gray value is <T, it is set to 255 (foreground text) and ≥T, it is set to 0 (background). It automatically adapts to documents with different gray distributions such as white background and light gray background. S24, morphological processing: Perform the following operations in sequence: dilate 2×2 rectangular structuring elements (iterate 1 time) to thicken thin strokes; erode 1×1 structuring elements to remove minor noise (area < 3 pixels); dilate 2×2 rectangular structuring elements to repair broken pixels.

[0019] Compared with existing technologies, the advantages of multi-stage image enhancement steps are as follows: (1) The dual-stage enhancement method combines global stretching and local precision enhancement, which improves local contrast compared to existing technologies and effectively solves the problem of text recognition caused by uneven lighting. (2) The dynamic OTSU threshold method (maximum inter-class variance method) is combined with denoising processing to automatically adapt to different gray scale distributions, thereby reducing the text edge loss rate. (3) Multi-round combined morphological operations are used to improve the integrity of strokes and solve the problem of OCR misrecognition caused by stroke breakage.

[0020] S3, Multimodal Feature Alignment Step: Based on the enhanced document image from step S2, multimodal information is integrated and cross-modal feature alignment is performed to obtain integrated multimodal data. Specifically, the process includes the following steps: S31, Multimodal Data Fusion: Image, structured text, and semi-structured text data are acquired through interfaces respectively; Principal Component Analysis (PCA) is used to normalize the dimensions of the above multi-source data, mapping the pixel features of images, word vector features of text, and structural features of tables to the same 1024-dimensional feature space; outliers in the fused feature set (determined by the Z-score algorithm, |Z|>3) are filtered out, and invalid data with excessively high blur (sharpness score <0.6, score calculated based on the Laplacian operator) are removed; S32, Cross-modal feature alignment: Based on the image coordinate system (origin at the top left corner, x-axis to the right, y-axis down), establish coordinate mapping relationships for each modality of data. Bind the table line edge coordinates (x1, y1, x2, y2) extracted in step S24 to the cell numbers of the table text to generate the "Table Coordinates - Text Position Mapping Table". Based on the semantic similarity calculation of the BERT language model, correct the semantic deviation between the text and the image content. Calculate the cosine similarity (threshold ≥ 0.8) between the printed text identified in the image and the preset field names. If it is lower than the threshold, it is corrected by mapping through a thesaurus. Perform consistency verification between the handwritten text and the preset semantics of the corresponding field. If there is a semantic conflict, mark it as an item to be verified.

[0021] S4, Cross-modal feature extraction step: Based on the multimodal data integrated in step S3, a fusion architecture of ResNet and BERT language models is constructed to achieve collaborative extraction of features from three modalities: image, text, and layout. The specific process includes the following steps: S41, Image and table feature extraction based on ResNet model: 512-dimensional feature vectors are output through convolutional layers to calculate the texture histogram of the stamp area, which is used to distinguish the stamp from the text; the gradient of the table line edges (gradient values ​​in the x and y directions) is extracted based on the Canny edge detection algorithm to determine the coordinates of the table column boundaries, with a fitting error ≤ 1 pixel; S42, Text feature extraction based on BERT language model: Output 768-dimensional text embedding vector through Transformer encoder, calculate the semantic similarity of adjacent characters (using cosine similarity, threshold ≥0.7) to correct semantic conflicts; For documents containing handwritten characters, extract the stroke width and tilt angle of the handwritten area to distinguish between handwritten and printed characters and avoid mixed recognition errors; S43, Based on the layout feature extraction of the image processed in step S24: Use connected component analysis to determine the coordinates of the upper left corner (x1, y1) and lower right corner (x2, y2) of each column, and establish the "Column Coordinates-Semantic Tag Mapping Table"; calculate the overlapping area between the text region and adjacent columns (overlap rate ≥ 30% is judged as cross-column), and record the original position association relationship of the cross-column text to correct cross-column attribution errors.

[0022] S5, High-Precision OCR Recognition Steps: Employing a hybrid strategy of "direct extraction priority + OCR recognition" to improve recognition accuracy, specifically including the following process: S51, Direct extraction of copyable text: Call the "Field Coordinates-Semantic Tag Mapping Table" output in step S4 to obtain the boundary coordinates (x1, y1, x2, y2) of each text field, and accurately locate the text area to be extracted; count the length of the text directly extracted from each field. If the text length is ≥1, it is determined as a valid extraction, and the text is associated with the field label and stored; if the text length is <1, the field is marked as "to be identified" and proceed to the second stage. S52, CRNN+ CTC recognition: Crops a local image of the "to be recognized" field to avoid interference from irrelevant areas; uniformly scales the cropped local image to a height of 64 pixels with adaptive width, and sets the image channel input to the CRNN model to grayscale; uses CTC beam search decoding, sets the beamwidth to 10, and selects the character sequence with the highest probability from the character probability distribution output by the fully connected layer.

[0023] S6, Cross-modal cross-validation steps: Establish a cross-validation mechanism for three types of modal features to verify document data, correct OCR recognition errors, and avoid error propagation. Specifically, this includes the following process: S61, Image-Text Modal Verification: Match the image features from step S4 with the OCR text from S5, calculate the matching degree (Euclidean distance, threshold 0.85). If the matching degree < 0.85, retrieve the template library and delete the misidentified text.

[0024] S62, Text-to-Text Modality Validation: Based on the text features from step S4, calculate the semantic coherence of the text. If the semantic similarity between adjacent characters is <0.7, call the terminology database and perform semantic prediction correction based on the BERT language model. S63, Layout-Text Modality Validation: Compare the layout features from step S4 with the column coordinates. If the text area coordinates exceed the column range, correct the attribution. If the cross-column text is not labeled with association, supplement the label according to the cross-column feature to correct the semantic break.

[0025] S7, Dynamic Parallel Processing Step: For the document data verified in step S6, efficient batch processing is achieved through an adaptive dynamic thread pool. Thread pool configuration: max_workers = min (32, number of CPU cores × 2); Thread safety mechanism: The form page is loaded using the fitz.Document.load_page method and a mutex lock (threading.Lock) is added. Result concatenation: Concatenate the processing results of each thread according to the document page number order (using a queue to store page numbers).

[0026] S8, Data Standardization and Archiving Steps: Standardize and store the results after step S7, specifically including the following process: S81, Extract Key Data Structures: Extract key fields from documents and convert them into JSON / XML format; S82, Secure Storage: Structured data is encrypted and stored using the AES-256 encryption algorithm; S83, Multi-dimensional Search: Build an index library to support searching by document type, date, number, and other dimensions.

[0027] In summary, the document processing method of the present invention, which integrates multi-stage image enhancement and cross-modal verification, constructs a full-process technical framework of "image enhancement and repair – cross-modal feature modeling – cross-modal verification – dynamic parallel processing". Relying on multi-stage image enhancement to repair scanning defects, cross-modal feature verification to correct OCR errors, and CPU adaptive dynamic thread pool to improve parallel efficiency, it achieves high efficiency and accuracy in document scanning processing, providing a better document processing solution for intelligent manufacturing.

[0028] Those skilled in the art should recognize that the above embodiments are merely illustrative of the present invention and are not intended to limit the present invention. Any variations or modifications to the above embodiments that are within the spirit and essence of the present invention will fall within the scope of the claims of the present invention.

Claims

1. A document processing method integrating multi-stage image enhancement and cross-modal verification, characterized in that, Includes the following steps: S1, Document Image Acquisition Steps: Receive scanned images of paper documents or electronic documents, and store the images using a lossless compression format; S2, Multi-stage image enhancement steps: Design a four-stage enhancement scheme of "ultra-wide edge filling + two-stage contrast enhancement + dynamic binarization + multi-round morphological processing" to repair image defects and obtain the enhanced document image; S3, Multimodal Feature Alignment Step: Based on the enhanced document image in step S2, integrate multimodal information and perform cross-modal feature alignment to obtain integrated multimodal data; S4, Cross-modal feature extraction step: Based on the multimodal data integrated in step S3, a fusion architecture of ResNet and BERT language models is constructed to achieve collaborative extraction of features from image, text, and layout modalities. S5, High-Precision OCR Recognition Steps: Employs a hybrid strategy of "direct extraction priority + OCR recognition" to improve recognition accuracy. S6, Cross-modal cross-validation step: Establish a cross-validation mechanism for three types of modal features to verify the document data, correct OCR recognition errors, and avoid error propagation; S7, Dynamic Parallel Processing Step: For the document data verified in step S6, efficient batch processing of documents is achieved through an adaptive dynamic thread pool; S8, Data Standardization and Archiving Steps: Standardize and securely store the documents processed in step S7.

2. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 1, characterized in that, In step S1, a scanning device or electronic document import module is used to collect scanned images of paper documents or electronic documents. The images may contain scanning defects such as edge truncation, blurring, and uneven lighting.

3. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 1, characterized in that, In step S2, the four-stage enhancement scheme specifically includes the following process: S21, Ultra-wide Edge Fill: Based on statistical analysis of manufacturing document samples, it was determined that a 25-pixel white edge should be added around the document image to cover more than 99% of edge truncation scenarios, fill in the edge printing truncation area, and avoid missing edge fields; S22, Dual-stage contrast enhancement: The first stage normalizes the grayscale value to 0-255 through a linear transformation formula to achieve global stretching and solve the problem of text loss caused by local over-darkness; The formula for linear transformation is: , In equation (1), L(x,y) is the original gray value, Lmin is the minimum gray value of the image, Lmax is the maximum gray value of the image, and G(x,y) is the normalized gray value. The second stage uses the CLAHE algorithm to divide the image into 36 local regions and specifically enhance the contrast of low-light areas. S23, Dynamic Binarization: First, the document image is denoised using 3×3 Gaussian blur. Then, the OTSU thresholding method is combined with inverse color binarization. When the pixel gray value is <T, it is set to 255 to represent the text foreground; when the pixel gray value is ≥T, it is set to 0 to represent the background. It automatically adapts to documents with different gray distributions. S24, morphological processing: Perform the following operations in sequence: dilate 2×2 rectangular structural elements to thicken thin strokes; erode 1×1 structural elements to remove tiny noise points with an area of ​​<3 pixels; dilate 2×2 rectangular structural elements to repair broken pixels.

4. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 3, characterized in that, In step S3, the multimodal feature alignment specifically includes the following process: S31, Multimodal Data Fusion: Image, structured text, and semi-structured text data are acquired through interfaces respectively; Principal component analysis is used to normalize the dimensions of the above multi-source data, mapping the pixel features of images, word vector features of text, and structural features of tables to the same 1024-dimensional feature space; outliers in the fused feature set are filtered out to remove invalid data with high ambiguity. S32, Cross-modal feature alignment: Based on the image coordinate system, establish coordinate mapping relationships for each modality of data, bind the table line edge coordinates extracted in S24 with the cell numbers of the table text, and generate the "Table Coordinate-Text Position Mapping Table"; Based on the semantic similarity calculation of the BERT language model, correct the semantic deviation between the text and image content, calculate the cosine similarity between the printed text identified in the image and the preset field names, set a threshold ≥0.8, and if it is lower than the threshold, it is corrected by mapping through the thesaurus; Perform consistency verification between the handwritten text and the preset semantics of the corresponding field, and mark the semantic conflict as an item to be verified.

5. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 4, characterized in that, In step S4, cross-modal feature extraction specifically includes the following process: S41, Image and table feature extraction based on ResNet model: 512-dimensional feature vectors are output through convolutional layers to calculate the texture histogram of the seal area, which is used to distinguish the seal from the text; the gradient of the table line edges is extracted based on the Canny edge detection algorithm to determine the coordinates of the table column boundaries, with a fitting error ≤ 1 pixel; S42, Text feature extraction based on BERT language model: The 768-dimensional text embedding vector is output by the Transformer encoder, and the semantic similarity between adjacent characters is calculated. The semantic similarity adopts cosine similarity and a threshold of ≥0.7 is set to correct semantic conflicts. For documents containing handwritten characters, the stroke width and tilt angle of the handwritten area are extracted to distinguish between handwritten and printed characters and avoid mixed recognition errors. S43, Based on the layout feature extraction of the image processed in step S24: Use connected component analysis to determine the coordinates of the upper left and lower right corners of each column and establish a "column coordinate-semantic label mapping table"; calculate the overlapping area between the text region and adjacent columns. If the overlap rate is ≥30%, it is determined to be a cross-column. Record the original position association relationship of the cross-column text to correct the cross-column attribution error.

6. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 5, characterized in that, In step S5, the high-precision OCR recognition specifically includes the following process: S51, Direct extraction of copyable text: Call the "Field Coordinates-Semantic Tag Mapping Table" output in step S4 to obtain the boundary coordinates of each text field and accurately locate the text area to be extracted; count the length of the text directly extracted from each field. If the text length is ≥1, it is determined to be a valid extraction, and the text is associated with the field label and stored; if the text length is <1, the field is marked as "to be identified" and the second stage is entered. S52, CRNN + CTC recognition: crop out a local image of the "to be recognized" field to avoid interference from irrelevant areas; uniformly scale the cropped local image to a height of 64 pixels with adaptive width, and set the image channel input to the CRNN model to grayscale; use CTC beam search decoding, set the beamwidth to 10, and filter out the character sequence with the highest probability from the character probability distribution output by the fully connected layer.

7. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 6, characterized in that, In step S6, the cross-modal cross-validation specifically includes the following process: S61, Image-Text Modal Verification: Match the image features from step S4 with the OCR text from step S5, calculate the matching degree, which is the Euclidean distance, with a threshold of 0.

85. If the matching degree is <0.85, retrieve the template library and delete the misidentified text. S62, Text-to-Text Modality Validation: Based on the text features of S4, calculate the semantic coherence of the text. If the semantic similarity between adjacent characters is <0.7, call the terminology database and perform semantic prediction correction based on the BERT language model. S63, Layout-Text Modality Validation: Compare the layout features from step S4 with the column coordinates. If the text area coordinates exceed the column range, correct the attribution. If the cross-column text is not labeled with association, supplement the label according to the cross-column feature to correct the semantic break.

8. The document processing method integrating multi-stage image enhancement and cross-modal verification according to claim 1, characterized in that, In step S8, data standardization and archiving specifically include the following processes: S81, Extract Key Data Structures: Extract key fields from documents and convert them into JSON / XML format; S82, Secure Storage: Structured data is encrypted and stored using the AES-256 encryption algorithm; S83, Multi-dimensional Search: Build an index library to support searching by document type, date, and number.