An automatic archive page number detection method and system
By calculating the weighted sum of confidence scores based on visual recognition, continuity, and regularization, and combining this with a multi-round closed-loop correction strategy, the low accuracy of existing OCR systems in recognizing handwritten document page numbers is solved. This enables automatic recognition and error correction of document page numbers, improving the automation and accuracy of document quality inspection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGXI RUANYUN TECH CORP LTD
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-23
AI Technical Summary
Existing OCR systems have low accuracy in recognizing page numbers in documents with handwriting, slant, blurriness, diverse fonts, and text interference. They often produce logical errors such as reversed page number order, confusing similar numbers, repeated page numbers, skipped pages, or omissions.
By calculating the weighted sum of visual recognition confidence, continuity confidence, and regularized verification confidence, and dynamically adjusting the weights, combined with a multi-round closed-loop correction strategy, including lightweight and enhanced correction processes, and utilizing multi-source confidence fusion and digital reverse order detection, an intelligent document page number detection algorithm framework is constructed.
It enables automatic identification and error correction of document page numbers, improving the automation and accuracy of document quality inspection. It can identify and correct page number errors, classify anomalies based on their reliability, and reduce manual intervention.
Smart Images

Figure CN121768002B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of archival digitization and computer vision recognition technology, and specifically relates to an automatic archival page number detection method and system. Background Technology
[0002] In the process of digitizing and quality inspection of archives, each page of the archive image usually needs to be labeled with a page number to ensure the order and integrity of the archives. Traditional methods often rely on manual verification or automatic detection based on OCR templates. However, in practice, page numbers are often handwritten, with inconsistent writing positions (often located in the lower left or right corner), and may be slanted, blurry, have diverse fonts, or interfere with the main text. Existing OCR systems have low accuracy in recognizing these types of page numbers, often resulting in the following issues:
[0003] Page numbers are written in reverse order (009 is recognized as 900);
[0004] Numbers that look similar can be confused (3 is recognized as 8, 2 as 7, and 5 as 6);
[0005] Logical errors such as duplicate page numbers, skipped pages, or omissions. Summary of the Invention
[0006] Based on this, the present invention provides an automatic page number detection method and system for archives, which aims to automatically detect and correct page number errors, thereby improving the automation and accuracy of archive quality inspection.
[0007] A first aspect of this invention provides a method for automatically detecting document page numbers, the method comprising:
[0008] Each page of the document is identified, the corresponding page number is extracted, and the overall confidence score of each page is calculated. The overall confidence score is obtained by weighted summation of visual recognition confidence score, continuity confidence score, and regularity verification confidence score.
[0009] The overall confidence level is compared with the overall confidence level range, different correction strategies are dynamically triggered, and a temporary page number sequence and the corrected overall confidence level are output.
[0010] Based on the corrected overall confidence level, the temporary page number sequence is subjected to multiple rounds of closed-loop correction;
[0011] Based on the correction results, the credibility level of the abnormal page numbers in the archives is classified.
[0012] Furthermore, in the step of identifying each page of the archive, extracting the corresponding page number, and calculating the comprehensive confidence of each page, the weights of visual recognition confidence, continuity confidence, and regularity verification confidence are adjusted according to the preprocessing features of the archive page number. The preprocessing features include at least ambiguity, format uniformity, and font type.
[0013] If the document page number is printed and the ambiguity is less than the preset ambiguity threshold, then increase the weight corresponding to the visual recognition confidence level.
[0014] If the file page number is handwritten and the ambiguity is greater than or equal to the preset ambiguity threshold, then increase the weights corresponding to the continuity confidence and regularity validation confidence.
[0015] If the page number format is consistent, increase the weight corresponding to the confidence score of the regular expression validation.
[0016] Furthermore, in the step of comparing the overall confidence level with the overall confidence level interval, dynamically triggering different correction strategies, and outputting the temporary page number sequence and the corrected overall confidence level, three overall confidence level intervals are preset, using a high confidence threshold T. high Confidence threshold T mid and low confidence threshold T low The system is divided into intervals, and a corresponding correction strategy is triggered based on the interval containing the overall confidence level. Specifically, this includes:
[0017] When the overall confidence level is greater than or equal to the high confidence threshold T high If the identification result is deemed reliable, no correction is triggered, and it is directly included in the temporary page number sequence;
[0018] When the overall confidence level is at the high confidence threshold T high and low confidence threshold T low In the meantime, a lightweight correction process is executed. If the page number is two or more digits, the number is reversed first, and the corrected first overall confidence level is calculated.
[0019] Determine whether the corrected first overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0020] If so, output the temporary page number sequence and terminate the lightweight correction;
[0021] If not, then based on the preset confused number set S={(3,8),(8,3),(2,7),(7,2),(5,6),(6,5),(9,0),(0,9)}, perform single character replacement and recalculate the corrected second comprehensive confidence level;
[0022] Determine whether the corrected second overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0023] If so, output the temporary page number sequence;
[0024] When the overall confidence level is less than or equal to the low confidence threshold T lowIf the time is right, then the enhanced correction process is executed, in which a combination of number reversal correction and obfuscated number set replacement is performed in sequence to obtain the first candidate page number;
[0025] Based on the trend prediction of adjacent page numbers, a reasonable value is obtained to obtain the second candidate page number;
[0026] Select the page number with the higher overall confidence level from the first candidate page number and the second candidate page number as the corrected page number, recalculate the corrected third overall confidence level, and output a temporary page number sequence.
[0027] Furthermore, the step of performing multiple rounds of closed-loop correction on the temporary page number sequence based on the corrected comprehensive confidence level includes:
[0028] Obtain the corrected overall confidence level, and determine whether the corrected overall confidence level is less than the confidence threshold T. mid ;
[0029] If so, the range of obfuscated digit replacements in the temporary page number sequence is expanded, and the overall confidence level is recalculated to obtain the first target confidence level;
[0030] Determine whether the confidence level of the first target is less than the confidence threshold T. mid ;
[0031] If so, then revise it again in conjunction with the page number format rules, and recalculate the overall confidence level to obtain the second target confidence level;
[0032] Determine whether the second target confidence level is greater than or equal to the middle confidence threshold T. mid ;
[0033] If so, output the target temporary page number sequence.
[0034] Furthermore, in the step of classifying the anomaly confidence level of the archive page numbers based on the correction results, the page number difference is calculated for the target temporary page number sequence, and the anomaly classification is performed by combining the comprehensive confidence levels before and after correction, which includes three anomaly levels.
[0035] Level 1 anomaly: The overall confidence level before correction is less than or equal to the low confidence threshold T. low The corrected overall confidence level is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is less than or equal to 1, the error will be automatically corrected and output without manual intervention.
[0036] Level 2 anomaly: The overall confidence level before correction is greater than the low confidence threshold T. low Less than the high confidence threshold T high Furthermore, the corrected overall confidence level is greater than or equal to the medium confidence threshold T. mid Less than the high confidence threshold Thigh Alternatively, the corrected overall confidence level is greater than or equal to the high confidence threshold T. high However, if the absolute value of the difference between adjacent page numbers is greater than or equal to 1, the abnormal position and correction suggestion will be marked, triggering manual review;
[0037] Level 3 anomaly: The overall confidence level before correction is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is greater than or equal to 1, it is marked as a physical anomaly in the archive and output to the archive processing stage.
[0038] Furthermore, the revised expression for the overall confidence level is:
[0039] ;
[0040] in, The overall confidence level before correction. The adjusted overall confidence level is given by λ, where λ is the fusion coefficient. This is the upper limit of the deviation. This is the corrected page number for page i. This is the corrected page number for page i-1.
[0041] A second aspect of the present invention provides an automatic document page number detection system for implementing an automatic document page number detection method provided in the first aspect of the present invention, the system comprising:
[0042] The recognition module is used to recognize each page of the document, extract the corresponding page number, and calculate the overall confidence of each page. The overall confidence is obtained by weighted summation of visual recognition confidence, continuity confidence, and regularity verification confidence.
[0043] The triggering module is used to compare the overall confidence level with the overall confidence level range, dynamically trigger different correction strategies, and output a temporary page number sequence and the corrected overall confidence level;
[0044] The correction module is used to perform multiple rounds of closed-loop correction on the temporary page number sequence based on the corrected comprehensive confidence level.
[0045] The grading module is used to grade the credibility of abnormal page numbers in archives based on the correction results.
[0046] A third aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the automatic document page number detection method provided in the first aspect.
[0047] A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the program to implement the automatic document page number detection method provided in the first aspect.
[0048] This invention provides an automatic page number detection method and system for archives. It identifies each page of an archive, extracts the corresponding page number, and calculates the overall confidence score for each page. The overall confidence score is obtained by weighted summation of visual recognition confidence score, continuity confidence score, and regularization verification confidence score. The overall confidence score is compared with an overall confidence score interval, dynamically triggering different correction strategies and outputting a temporary page number sequence and the corrected overall confidence score. Based on the corrected overall confidence score, the temporary page number sequence undergoes multiple rounds of closed-loop correction. Based on the correction results, the anomaly confidence level of the archive page numbers is graded. Specifically, an intelligent page number detection algorithm framework is constructed through multi-source confidence score fusion, number reversal detection, and similar number confusion restoration, achieving full automation of the entire process of automatic page number recognition, error correction, and continuity anomaly detection. Attached Figure Description
[0049] Figure 1 The flowchart illustrates the implementation of an automatic page number detection method for archives provided in Embodiment 1 of the present invention.
[0050] Figure 2 This is a structural block diagram of an automatic document page number detection system provided in Embodiment 2 of the present invention;
[0051] Figure 3 This is a structural block diagram of an electronic device provided in Embodiment 3 of the present invention. Detailed Implementation
[0052] To facilitate understanding of the present invention, a more complete description will be given below with reference to the accompanying drawings. Several embodiments of the invention are illustrated in the drawings. However, the invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
[0053] It should be noted that when a component is said to be "fixed to" another component, it can be directly on the other component or there may be an intervening component. When a component is said to be "connected to" another component, it can be directly connected to the other component or there may be an intervening component. The terms "vertical," "horizontal," "left," "right," and similar expressions used in this document are for illustrative purposes only.
[0054] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0055] Example 1
[0056] Please see Figure 1 , Figure 1 The present invention illustrates an automatic document page number detection method according to Embodiment 1, which specifically includes steps S01 to S04.
[0057] Step S01: Identify each page of the document, extract the corresponding page number, and calculate the overall confidence level of each page.
[0058] Specifically, the overall confidence level is obtained by weighted summation of visual recognition confidence level, continuity confidence level, and regularization verification confidence level, and can be expressed as:
[0059] C=αC v +βC s +γC r (α+β+γ=1);
[0060] Among them, C v For visual recognition confidence, C s For the continuity confidence level, C r To validate the confidence level, α, β, and γ represent the weights of the corresponding confidence levels. It is understandable that visual recognition confidence level is an indicator (range [0,1]) representing the reliability of OCR or visual language models in visually matching document page numbers. Its calculation requires first extracting character-level raw scores (such as matching similarity, prediction probability, and logarithmic probability conversion values) from traditional OCR, deep learning OCR, or visual language models. After normalization to a unified dimension using Min-Max or Sigmoid, arithmetic mean, geometric mean, or position-weighted fusion (strengthening the weight of high-order characters) is applied for multi-digit page numbers. A penalty coefficient is then applied in conjunction with document page number format constraints. Finally, through abnormal character removal, clarity correlation correction to adapt to handwritten characters, blurriness, tilt, and other special scenarios, a robust visual confidence level result is obtained. The formula for calculating continuous confidence level is:
[0061] ;
[0062] Where, p i p is the page number for page i. i-1 For page number i-1, Δ maxThis is the maximum permissible deviation value for the preset page number, i.e., the upper limit of the deviation.
[0063] Regular expression validation confidence is used to verify the reasonableness of page number format (such as the number of digits and the range of values). When the format conforms to preset rules, C... r =1, otherwise assign 0 according to the degree of deviation. <C r <1.
[0064] It should be noted that, based on the preprocessing features of the archival page numbers, the weights of visual recognition confidence, continuity confidence, and regularization verification confidence are adjusted. These preprocessing features include at least ambiguity, format uniformity, and font type. Ambiguity is a preprocessing index characterizing the clarity of the archival page number image (value range [0,1], with higher values indicating a more blurred image). Its calculation requires first locating the ROI (Region of Interest) containing the page number through region detection to eliminate irrelevant interference such as text. Then, the region is converted to a grayscale image and slightly denoised using Gaussian filtering. Subsequently, the second derivative variance of the image pixels is calculated using the Laplacian operator (or the edge gradient is extracted using the Sobel operator, and the mean amplitude is calculated, along with the image grayscale entropy). Finally, the calculation results are normalized using Min-Max and mapped to the [0,1] interval. The smaller the edge gradient amplitude, the smaller the Laplacian variance, or the lower the grayscale entropy, the higher the corresponding ambiguity value. Font types include handwritten and printed fonts.
[0065] If the document page number is printed and the ambiguity is less than the preset ambiguity threshold, the weight corresponding to the visual recognition confidence is increased. In this embodiment of the invention, α∈[0.5,0.6], β∈[0.2,0.25], and γ∈[0.2,0.25] are set to prioritize the trust of the visual recognition result.
[0066] If the file page number is handwritten and the ambiguity is greater than or equal to the preset ambiguity threshold, then the weights corresponding to the continuity confidence and regularity verification confidence are increased. In this embodiment of the invention, α∈[0.3,0.4], β∈[0.35,0.4], and γ∈[0.25,0.3] are set, prioritizing logical continuity and format rationality.
[0067] If the page number format is uniform (e.g., fixed as 2 or 3 digits), the weight corresponding to the regular expression validation confidence is increased. In this embodiment of the invention, α∈[0.4,0.45], β∈[0.25,0.3], and γ∈[0.25,0.3] are set to strengthen the regular expression validation weight.
[0068] Step S02: Compare the overall confidence level with the overall confidence level range, dynamically trigger different correction strategies, and output the temporary page number sequence and the corrected overall confidence level.
[0069] Specifically, three comprehensive confidence intervals are preset, using a high confidence threshold T.high Confidence threshold T mid and low confidence threshold T low In this embodiment of the invention, the high confidence threshold T is used for classification. high ∈[0.75,0.85], confidence threshold T mid ∈[0.6,0.7], low confidence threshold T low For values ∈ [0.45, 0.55], a corresponding correction strategy is triggered based on the interval containing the comprehensive confidence level, specifically including:
[0070] When the overall confidence level is greater than or equal to the high confidence threshold T high If the identification result is deemed reliable, no correction is triggered, and it is directly included in the temporary page number sequence;
[0071] When the overall confidence level is at the high confidence threshold T high and low confidence threshold T low In the meantime, a lightweight correction process is executed. If the page number is two or more digits, the number is first reversed, i.e., rev(p i =ReverseDigits(p i ), calculate the corrected first overall confidence level;
[0072] Determine whether the corrected first overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0073] If so, output the temporary page number sequence and terminate the lightweight correction;
[0074] If not, then based on the preset confused number set S={(3,8),(8,3),(2,7),(7,2),(5,6),(6,5),(9,0),(0,9)}, single character replacement is performed, and the corrected second comprehensive confidence level is recalculated;
[0075] Determine whether the corrected second overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0076] If so, output the temporary page number sequence;
[0077] When the overall confidence level is less than or equal to the low confidence threshold T low If the time is right, then the enhanced correction process is executed, in which a combination of number reversal correction and obfuscated number set replacement is performed in sequence to obtain the first candidate page number;
[0078] Based on the trend prediction of adjacent page numbers, a reasonable value is obtained to determine the second candidate page number. Specifically, according to p... i-2 p i-1 p i+1 p i+2Fitting a linear sequence p pred =k×i+b (k and b are fitting coefficients), to obtain the second candidate page number;
[0079] Select the page number with the higher overall confidence level from the first candidate page number and the second candidate page number as the corrected page number, recalculate the corrected third overall confidence level, and output a temporary page number sequence.
[0080] Step S03: Perform multiple rounds of closed-loop correction on the temporary page number sequence based on the corrected overall confidence level.
[0081] Specifically, the corrected overall confidence level is obtained, and it is determined whether the corrected overall confidence level is less than the confidence threshold T. mid ;
[0082] If so, the range of obfuscated digit replacements in the temporary page number sequence is expanded (allowing replacement of 2 characters, such as "89" → "30"), and the overall confidence level is recalculated to obtain the first target confidence level;
[0083] Determine whether the confidence level of the first target is less than the confidence threshold T. mid ;
[0084] If so, then it is corrected again by combining page number format rules (such as padding with zeros and removing redundant characters), and the overall confidence level is recalculated to obtain the second target confidence level;
[0085] Determine whether the second target confidence level is greater than or equal to the middle confidence threshold T. mid ;
[0086] If yes, output the target temporary page number sequence; otherwise, mark it as a page number to be manually reviewed.
[0087] It should be noted that the revised expression for the overall confidence level is:
[0088] ;
[0089] in, The overall confidence level before correction. The adjusted overall confidence level is given by λ, where λ is the fusion coefficient. This is the upper limit of the deviation. This is the corrected page number for page i. This refers to the corrected page number of page i-1. Understandably, when recalculating the overall confidence level after correcting the page number, it needs to be calculated using the corrected expression for the overall confidence level.
[0090] Step S04: Based on the correction results, classify the credibility of the abnormality of the archive page numbers.
[0091] It should be noted that the page number difference is calculated for the target temporary page number sequence, and the anomaly classification is performed by combining the comprehensive confidence before and after the correction, which includes three anomaly levels;
[0092] Level 1 anomaly: The overall confidence level before correction is less than or equal to the low confidence threshold T. low The corrected overall confidence level is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is less than or equal to 1, the error will be automatically corrected and output without manual intervention.
[0093] Level 2 anomaly: The overall confidence level before correction is greater than the low confidence threshold T. low Less than the high confidence threshold T high Furthermore, the corrected overall confidence level is greater than or equal to the medium confidence threshold T. mid Less than the high confidence threshold T high Alternatively, the corrected overall confidence level is greater than or equal to the high confidence threshold T. high However, if the absolute value of the difference between adjacent page numbers is greater than or equal to 1, the abnormal position and correction suggestion (such as "originally identified as 18, corrected to 13, it is recommended to check whether the file is missing pages") are marked, triggering manual review. This level 2 abnormality is mainly used to focus on high-risk abnormalities and improve review efficiency.
[0094] Level 3 anomaly: The overall confidence level before correction is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is greater than or equal to 1, it is marked as a physical anomaly in the archive (such as skipped pages, duplicate pages, or missing page numbers) and output to the archive processing stage. This level three anomaly is mainly used to distinguish between "identification errors" and "issues with the archive itself" to avoid invalid corrections.
[0095] Understandably, the above-mentioned automatic page number detection method for archives can be summarized as follows: First, the page numbers of the archives are extracted using OCR or a visual language model to form an initial array. The confidence scores for visual recognition, continuity, and regularization are calculated and then weighted and fused to obtain the comprehensive confidence score. Next, the weight coefficients are dynamically adjusted based on the ambiguity, font type, and format uniformity of the archive page numbers. Based on a preset threshold, the comprehensive confidence score is divided into high, medium, and low intervals, triggering light correction (reversal + single character similarity replacement), enhanced correction (combined correction + trend prediction + format optimization), or no correction strategy, respectively. Multi-round closed-loop verification is combined to ensure the reliability of the correction. Finally, the page number difference and the confidence scores before and after correction are used to achieve a three-level classification of anomalies (machine-reliable correction, manual review required, physical anomalies of the archives).
[0096] In summary, the automatic page number detection method for archives in the above embodiments of the present invention identifies each page of the archive, extracts the corresponding page number, and calculates the comprehensive confidence score of each page. The comprehensive confidence score is obtained by weighted summation of visual recognition confidence score, continuity confidence score, and regularization verification confidence score. The method compares the comprehensive confidence score with a comprehensive confidence score interval, dynamically triggers different correction strategies, and outputs a temporary page number sequence and the corrected comprehensive confidence score. Based on the corrected comprehensive confidence score, the temporary page number sequence undergoes multiple rounds of closed-loop correction. Based on the correction results, the anomaly confidence level of the archive page numbers is graded. Specifically, through multi-source confidence score fusion, number reversal detection, and similar number confusion restoration, an intelligent page number detection algorithm framework is constructed, realizing full-process automation of page number automatic recognition, error correction, and continuity anomaly detection.
[0097] Example 2
[0098] Please see Figure 2 , Figure 2 This is a structural block diagram of an automatic document page number detection system 200 provided in Embodiment 2 of the present invention. The automatic document page number detection system 200 specifically includes: an identification module 21, a triggering module 22, a correction module 23, and a hierarchical module 24, wherein:
[0099] The recognition module 21 is used to recognize each page of the document, extract the corresponding page number, and calculate the comprehensive confidence score of each page. The comprehensive confidence score is obtained by weighted summation of visual recognition confidence score, continuity confidence score, and regular expression verification confidence score. The weights of visual recognition confidence score, continuity confidence score, and regular expression verification confidence score are adjusted according to the preprocessing features of the document page number. The preprocessing features include at least ambiguity, format uniformity, and font type.
[0100] If the document page number is printed and the ambiguity is less than the preset ambiguity threshold, then increase the weight corresponding to the visual recognition confidence level.
[0101] If the file page number is handwritten and the ambiguity is greater than or equal to the preset ambiguity threshold, then increase the weights corresponding to the continuity confidence and regularity validation confidence.
[0102] If the page number format is consistent, increase the weight corresponding to the confidence score of the regular expression validation.
[0103] Trigger module 22 is used to compare the overall confidence level with the overall confidence level interval, dynamically trigger different correction strategies, and output a temporary page number sequence and the corrected overall confidence level. Three overall confidence levels are preset, and a high confidence threshold T is used to determine the corrected overall confidence level. high Confidence threshold T mid and low confidence threshold T low The system is divided into intervals, and a corresponding correction strategy is triggered based on the interval containing the overall confidence level. Specifically, this includes:
[0104] When the overall confidence level is greater than or equal to the high confidence threshold T high If the identification result is deemed reliable, no correction is triggered, and it is directly included in the temporary page number sequence;
[0105] When the overall confidence level is at the high confidence threshold T high and low confidence threshold T low In the meantime, a lightweight correction process is executed. If the page number is two or more digits, the number is reversed first, and the corrected first overall confidence level is calculated.
[0106] Determine whether the corrected first overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0107] If so, output the temporary page number sequence and terminate the lightweight correction;
[0108] If not, then based on the preset confused number set S={(3,8),(8,3),(2,7),(7,2),(5,6),(6,5),(9,0),(0,9)}, perform single character replacement and recalculate the corrected second comprehensive confidence level;
[0109] Determine whether the corrected second overall confidence level is greater than or equal to the high confidence threshold T. high ;
[0110] If so, output the temporary page number sequence;
[0111] When the overall confidence level is less than or equal to the low confidence threshold T low If the time is right, then the enhanced correction process is executed, in which a combination of number reversal correction and obfuscated number set replacement is performed in sequence to obtain the first candidate page number;
[0112] Based on the trend prediction of adjacent page numbers, a reasonable value is obtained to obtain the second candidate page number;
[0113] Select the page number with the higher overall confidence level from the first candidate page number and the second candidate page number as the corrected page number, recalculate the corrected third overall confidence level, and output a temporary page number sequence;
[0114] Correction module 23 is used to perform multiple rounds of closed-loop correction on the temporary page number sequence based on the corrected comprehensive confidence level;
[0115] The grading module 24 is used to grade the anomaly confidence level of the archive page numbers according to the correction result. The page number difference is calculated for the target temporary page number sequence, and the anomaly grading is performed by combining the comprehensive confidence level before and after the correction. The anomaly level includes three levels.
[0116] Level 1 anomaly: The overall confidence level before correction is less than or equal to the low confidence threshold T. low The corrected overall confidence level is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is less than or equal to 1, the error will be automatically corrected and output without manual intervention.
[0117] Level 2 anomaly: The overall confidence level before correction is greater than the low confidence threshold T. low Less than the high confidence threshold T high Furthermore, the corrected overall confidence level is greater than or equal to the medium confidence threshold T. mid Less than the high confidence threshold T high Alternatively, the corrected overall confidence level is greater than or equal to the high confidence threshold T. high However, if the absolute value of the difference between adjacent page numbers is greater than or equal to 1, the abnormal position and correction suggestion will be marked, triggering manual review;
[0118] Level 3 anomaly: The overall confidence level before correction is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is greater than or equal to 1, it is marked as a physical anomaly in the archive and output to the archive processing stage.
[0119] The revised expression for the overall confidence level is:
[0120] ;
[0121] in, The overall confidence level before correction. The adjusted overall confidence level is given by λ, where λ is the fusion coefficient. This is the upper limit of the deviation. This is the corrected page number for page i. This is the corrected page number for page i-1.
[0122] Furthermore, in some optional embodiments of the present invention, the correction module 23 includes:
[0123] The first judgment unit is used to obtain the corrected comprehensive confidence level and determine whether the corrected comprehensive confidence level is less than the confidence threshold T. mid ;
[0124] The first calculation unit is used to determine when the corrected overall confidence level is less than the intermediate confidence threshold T. mid If the temporary page number sequence is expanded to include a range of obfuscated digits, the overall confidence level is recalculated to obtain the first target confidence level.
[0125] The second judgment unit is used to determine whether the first target confidence level is less than the middle confidence threshold T. mid ;
[0126] The second calculation unit is used to determine when the confidence level of the first target is less than the confidence threshold T. mid If necessary, the page number format rules are combined to make further corrections, and the overall confidence level is recalculated to obtain the second target confidence level;
[0127] The third judgment unit is used to determine whether the second target confidence level is greater than or equal to the middle confidence threshold T. mid ;
[0128] The output unit is used to determine whether the confidence level of the second target is greater than or equal to the confidence threshold T. mid If so, the target temporary page number sequence will be output.
[0129] Example 3
[0130] In another aspect, the present invention also proposes an electronic device, please refer to [link to relevant documentation]. Figure 3 The image shows an electronic device according to Embodiment 3 of the present invention, including a memory 20, a processor 10, and a computer program 30 stored in the memory and executable on the processor. When the processor 10 executes the computer program 30, it implements the automatic file page number detection method as described above.
[0131] In some embodiments, the processor 10 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip, used to run program code stored in memory 20 or process data, such as executing access restriction programs.
[0132] The memory 20 includes at least one type of readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 20 can be an internal storage unit of an electronic device, such as the hard disk of the electronic device. In other embodiments, the memory 20 can also be an external storage device of the electronic device, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. Furthermore, the memory 20 can include both internal and external storage units of the electronic device. The memory 20 can be used not only to store application software and various types of data of the electronic device, but also to temporarily store data that has been output or will be output.
[0133] It should be pointed out that, Figure 3The structure shown does not constitute a limitation on the electronic device. In other embodiments, the electronic device may include fewer or more components than shown, or combine certain components, or have different component arrangements.
[0134] This invention also proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the automatic document page number detection method described above.
[0135] Those skilled in the art will understand that the logic and / or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can mean any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.
[0136] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0137] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0138] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0139] The above embodiments merely illustrate several implementation methods of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this patent should be determined by the appended claims.
Claims
1. A method for automatically detecting page numbers in archives, characterized in that, The method includes: Each page of the document is identified, the corresponding page number is extracted, and the overall confidence score of each page is calculated. The overall confidence score is obtained by weighted summation of visual recognition confidence score, continuity confidence score, and regularity verification confidence score. The overall confidence level is compared with the overall confidence level range, different correction strategies are dynamically triggered, and a temporary page number sequence and the corrected overall confidence level are output. Based on the corrected overall confidence level, the temporary page number sequence is subjected to multiple rounds of closed-loop correction; Based on the correction results, the credibility of the abnormal page numbers in the archives is classified. In the step of comparing the overall confidence level with the overall confidence level interval, dynamically triggering different correction strategies, and outputting a temporary page number sequence and the corrected overall confidence level, three overall confidence level intervals are preset, and a high confidence threshold T is used. high Confidence threshold T mid and low confidence threshold T low The system is divided into intervals, and a corresponding correction strategy is triggered based on the interval containing the overall confidence level. Specifically, this includes: When the overall confidence level is greater than or equal to the high confidence threshold T high If the identification result is deemed reliable, no correction is triggered, and it is directly included in the temporary page number sequence; When the overall confidence level is at the high confidence threshold T high and low confidence threshold T low In the meantime, a lightweight correction process is executed. If the page number is more than two digits, the number is reversed first, and the corrected first overall confidence level is calculated. Determine whether the corrected first overall confidence level is greater than or equal to the high confidence threshold T. high ; If so, output the temporary page number sequence and terminate the lightweight correction; If not, then based on the preset confused number set S={(3,8),(8,3),(2,7),(7,2),(5,6),(6,5),(9,0),(0,9)}, perform single character replacement and recalculate the corrected second comprehensive confidence level; Determine whether the corrected second overall confidence level is greater than or equal to the high confidence threshold T. high ; If so, output the temporary page number sequence; When the overall confidence level is less than or equal to the low confidence threshold T low If the time is right, then the enhanced correction process is executed, in which a combination of number reversal correction and obfuscated number set replacement is performed in sequence to obtain the first candidate page number; Based on the trend prediction of adjacent page numbers, a reasonable value is obtained to obtain the second candidate page number; Select the page number with the higher overall confidence level from the first candidate page number and the second candidate page number as the corrected page number, recalculate the corrected third overall confidence level, and output a temporary page number sequence.
2. The automatic page number detection method for archives according to claim 1, characterized in that, In the steps of identifying each page of the archive, extracting the corresponding page number, and calculating the comprehensive confidence of each page, the weights of visual recognition confidence, continuity confidence, and regularity verification confidence are adjusted according to the preprocessing features of the archive page number. The preprocessing features include at least ambiguity, format uniformity, and font type. If the document page number is printed and the ambiguity is less than the preset ambiguity threshold, then increase the weight corresponding to the visual recognition confidence level. If the file page number is handwritten and the ambiguity is greater than or equal to the preset ambiguity threshold, then increase the weights corresponding to the continuity confidence and regularity validation confidence. If the page number format is consistent, increase the weight corresponding to the confidence score of the regular expression validation.
3. The automatic page number detection method for archives according to claim 2, characterized in that, The step of performing multi-round closed-loop correction on the temporary page number sequence based on the corrected comprehensive confidence level includes: Obtain the corrected overall confidence level, and determine whether the corrected overall confidence level is less than the confidence threshold T. mid ; If so, the range of obfuscated digit replacements in the temporary page number sequence is expanded, and the overall confidence level is recalculated to obtain the first target confidence level; Determine whether the confidence level of the first target is less than the confidence threshold T. mid ; If so, then revise it again in conjunction with the page number format rules, and recalculate the overall confidence level to obtain the second target confidence level; Determine whether the second target confidence level is greater than or equal to the middle confidence threshold T. mid ; If so, output the target temporary page number sequence.
4. The automatic page number detection method for archives according to claim 3, characterized in that, In the step of classifying the anomaly confidence level of the archive page numbers based on the correction results, the page number difference of the target temporary page number sequence is calculated, and the anomaly classification is performed by combining the comprehensive confidence level before and after the correction, which includes three anomaly levels. Level 1 anomaly: The overall confidence level before correction is less than or equal to the low confidence threshold T. low The corrected overall confidence level is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is less than or equal to 1, the error will be automatically corrected and output without manual intervention. Level 2 anomaly: The overall confidence level before correction is greater than the low confidence threshold T. low Less than the high confidence threshold T high Furthermore, the corrected overall confidence level is greater than or equal to the medium confidence threshold T. mid Less than the high confidence threshold T high Alternatively, the corrected overall confidence level is greater than or equal to the high confidence threshold T. high However, if the absolute value of the difference between adjacent page numbers is greater than or equal to 1, the abnormal position and correction suggestion will be marked, triggering manual review; Level 3 anomaly: The overall confidence level before correction is greater than or equal to the high confidence threshold T. high If the absolute value of the difference between adjacent page numbers is greater than or equal to 1, it is marked as a physical anomaly in the archive and output to the archive processing stage.
5. The automatic page number detection method for archives according to claim 4, characterized in that, The revised expression for the overall confidence level is: ; in, The overall confidence level before correction. The adjusted overall confidence level is given by λ, where λ is the fusion coefficient. This is the upper limit of the deviation. This is the corrected page number for page i. This is the corrected page number for page i-1.
6. An automatic page number detection system for archives, characterized in that, For implementing the automatic page number detection method for archives as described in any one of claims 1-5, the system comprises: The recognition module is used to recognize each page of the document, extract the corresponding page number, and calculate the overall confidence of each page. The overall confidence is obtained by weighted summation of visual recognition confidence, continuity confidence, and regularity verification confidence. The triggering module is used to compare the overall confidence level with the overall confidence level range, dynamically trigger different correction strategies, and output a temporary page number sequence and the corrected overall confidence level; The correction module is used to perform multiple rounds of closed-loop correction on the temporary page number sequence based on the corrected comprehensive confidence level. The grading module is used to grade the credibility of abnormal page numbers in archives based on the correction results.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by the processor, the program implements the automatic file page number detection method as described in any one of claims 1-5.
8. An electronic device, characterized in that, The system includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor, when executing the program, implements the automatic file page number detection method as described in any one of claims 1-5.