Method and system for character recognition of business expansion attachment images based on OCR

By combining optical character recognition technology with confidence analysis and coding rule sets, candidate character sets are screened and verified, solving the problem of the accuracy of unified social credit code recognition caused by seal obscuring or image blurring in business license images, and achieving accurate restoration in complex environments.

CN122244877APending Publication Date: 2026-06-19GUANGDONG POWER GRID CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG POWER GRID CO LTD
Filing Date
2026-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In the power business system, the accuracy of automatic recognition of the unified social credit code is difficult to guarantee due to the seal obscuring or the image being blurred on the business license image. Existing technology is also unable to accurately restore the correct characters in complex environments.

Method used

By combining optical character recognition technology with confidence analysis and encoding rule sets, candidate character sets are screened and verified, and verification algorithms are used to ensure accurate restoration of the unified social credit code.

🎯Benefits of technology

Even when the seal is obscured or the image is blurred, the robustness and accuracy of the identification are significantly improved, ensuring the uniqueness and accuracy of the unified social credit code.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244877A_ABST
    Figure CN122244877A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for character recognition in business expansion attachment images based on OCR, relating to natural language processing, and particularly to the field of character recognition. The method involves: performing optical character recognition on the business expansion attachment image containing the Unified Social Credit Code to obtain a recognition result; determining at least one fuzzy character position based on the recognition result, and determining candidate character sets for each fuzzy character position; filtering the candidate character sets corresponding to each fuzzy character position according to the legal character sets for each character position to determine the legal target candidate character sets for each fuzzy character position; substituting each target candidate character set into the corresponding fuzzy character position to obtain at least one character sequence; and constructing a candidate complete code sequence based on the character sequence; and verifying all candidate complete code sequences using a verification algorithm to determine the restored Unified Social Credit Code. Therefore, this invention can accurately restore the correct Unified Social Credit Code.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of character recognition, and more particularly to a method and system for character recognition of commercial attachment images based on OCR. Background Technology

[0002] In the power company's business expansion application process, customers need to submit a business license as proof of their legal status. The "Unified Social Credit Code" is the unique and unchanging identifier of the legal entity, used throughout the entire process, including user registration, contract signing, electricity account binding, and subsequent billing and auditing. Automatically and accurately identifying this code avoids the inefficiency and error risks associated with manual data entry, enabling automated verification of business information and cross-system data synchronization. This is a crucial step in improving the service efficiency of power company service halls and ensuring the consistency of customer information and business compliance.

[0003] Currently, power business systems commonly use optical character recognition (OCR) technology to automatically extract the Unified Social Credit Code from business license images. However, relying solely on the output confidence level for simple filtering is insufficient to guarantee accuracy. Specifically, in real-world business scenarios, original or scanned business licenses often bear the company's red official seal, and the seal's position frequently overlaps with the Unified Social Credit Code area, resulting in some characters being obscured, strokes overlapping, or complex background textures. Furthermore, the scanning process may introduce blurring factors such as uneven lighting, insufficient resolution, and image tilt. When key characters are obscured or blurred, misidentification, omissions, or inability to determine the correct character are highly likely, making it difficult to guarantee the uniqueness and accuracy of the restored results. Summary of the Invention

[0004] This invention provides a character recognition method and system for business expansion attachment images based on OCR, which can accurately restore the correct unified social credit code when the seal is obscured or the image is blurred.

[0005] The first aspect of this invention provides a character recognition method for commercial attachment images based on OCR, comprising: Optical character recognition is performed on the business expansion attachment image containing the unified social credit code to obtain the recognition result. Based on the recognition result, at least one fuzzy character position is determined, and a candidate character set is determined for each fuzzy character position. Obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; Based on the legal character set of each character position, the candidate character set corresponding to each fuzzy character position is filtered to determine the legal target candidate character set at each fuzzy character position. Each target candidate character set is then substituted into the corresponding fuzzy character position to obtain at least one character sequence. A candidate complete code sequence is then constructed based on the character sequence. The verification algorithm is used to verify all the candidate complete code sequences, and the restored unified social credit code is determined based on the verification results.

[0006] This invention, through confidence analysis of optical character recognition results, determines fuzzy character positions and their candidate character sets. This effectively identifies "suspicious points" in the recognition process and preserves multiple possible correction directions, providing foundational data for subsequent accurate reconstruction. Secondly, by introducing the inherent encoding rule set of the unified social credit code, especially the legal character set for each character position, a first-level strict screening of candidate characters at fuzzy character positions can be performed, eliminating candidate characters that do not conform to the character composition rules, thereby reducing the scale of candidate combinations at the source. Thirdly, by substituting the filtered legal target candidate character set into the fuzzy character position to construct a complete candidate code sequence, the transformation from "character-level candidate" to "code-level candidate" is realized, creating conditions for verification using deeper encoding rules. Finally, a verification algorithm is used to perform a second verification on all candidate complete code sequences. Through mathematical logic consistency checks, all erroneous combinations that do not meet the verification rules are eliminated. When the verification result is unique, the final result is locked. The entire solution deeply integrates the fuzzy recognition results of OCR with the strong verification rules of the unified social credit code, enabling the unique and accurate reconstruction of the correct unified social credit code from a variety of uncertain candidate character combinations, significantly improving the robustness and accuracy of recognition in complex image environments.

[0007] Further, the step of performing optical character recognition on the business expansion attachment image containing the unified social credit code to obtain a recognition result, and determining at least one fuzzy character position based on the recognition result, includes: Image preprocessing is performed on the business expansion attachment image containing the unified social credit code to obtain optimized image data; The optimized image data is identified using optical character recognition technology to obtain a first character sequence and a confidence sequence corresponding to each character; Each confidence level in the confidence sequence is compared with a preset confidence threshold, and the character positions with confidence levels lower than the confidence threshold are marked as fuzzy character positions.

[0008] This improves image quality through image preprocessing, providing clearer input for subsequent recognition. The introduction of a confidence threshold mechanism can accurately locate uncertain character positions in the OCR recognition results caused by seal obstruction or image blurring, thus clearly distinguishing between "suspicious areas" and "reliable areas," laying the foundation for subsequent targeted restoration.

[0009] Further, the image preprocessing of the business expansion attachment image containing the unified social credit code to obtain optimized image data includes: The image of the business expansion attachment is subjected to denoising processing to obtain the first processing result; The first processing result is subjected to contrast enhancement processing to obtain the second processing result; The second processing result is then normalized to obtain optimized image data.

[0010] Further, determining the candidate character set in each of the fuzzy character positions includes: From the recognition results, obtain the initial candidate character list corresponding to each of the fuzzy character positions and the recognition confidence level corresponding to each candidate character; Based on the recognition confidence level, the candidate characters in the initial candidate character list are sorted, and a preset number of candidate characters at the top of the sort are selected to form the candidate character set in the fuzzy character position.

[0011] This approach, for each ambiguous character position, not only retains the preferred recognition result but also extracts multiple candidate characters and their confidence levels, preserving various possible correction directions. It provides ample candidate space for subsequent filtering and restoration using encoding rules, avoiding situations where restoration is impossible due to a single recognition error.

[0012] Further, the step of filtering the candidate character sets corresponding to each fuzzy character position based on the legal character set of each character position to determine the legal target candidate character set for each fuzzy character position includes: The candidate characters in the candidate character set of each fuzzy character position that belong to the corresponding legal character set are determined as the legal candidate characters corresponding to each fuzzy character position. The legal candidate characters for each of the fuzzy character positions are sorted in descending order of confidence level to obtain the target candidate character set.

[0013] By leveraging the inherent character validity constraints of the unified social credit code, candidate characters for each fuzzy character position are pre-screened, and illegal characters that do not meet the character set requirements for that character position are eliminated. This effectively reduces the size of candidate combinations and improves the efficiency and accuracy of subsequent verification.

[0014] Further, substituting each of the target candidate character sets into the corresponding fuzzy character positions to obtain at least one character sequence includes: The content of the non-fuzzy character positions in the recognition result is used as the basic part sequence; Each fuzzy character position is traversed sequentially, and the currently traversed fuzzy character position is replaced with each candidate character in the target candidate character set corresponding to each fuzzy character position, generating the intermediate character sequence of the current round, and using the intermediate character sequence of the current round as the base character sequence for the next fuzzy character position traversal; The process continues until all ambiguous character positions have been traversed and replaced, at which point the final generated sequence of intermediate characters is determined as the character sequence.

[0015] This method of traversing and replacing systematically combines candidate characters for each ambiguous character position with high-confidence characters, which can exhaustively generate all possible complete code sequences, ensuring that no potentially correct candidate combinations are missed, and providing a complete candidate space for final verification.

[0016] Further, the step of verifying all the candidate complete code sequences using the verification algorithm to determine the restored unified social credit code based on the verification results includes: Traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm to generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. Determine the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then determine the corresponding candidate complete code sequence as the restored unified social credit code. If the first round of the pass set contains multiple candidate complete code sequences, then the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round of the pass set is calculated, and the candidate complete code sequence with the highest sum of confidence scores is selected as the restored unified social credit code.

[0017] In this way, the candidate sequences are logically verified by the verification algorithm, and the strong verification rules of the unified social credit code are used to eliminate mathematically invalid incorrect combinations. When there are multiple sequences that pass the verification, the best one is selected based on the sum of confidence scores, so as to ensure that the most likely correct restoration result can be output even in the case of multiple solutions.

[0018] Furthermore, the encoding rule set also includes character length constraints; before filtering the candidate character sets corresponding to each fuzzy character position based on the legal character set of each character position, it also includes: Obtain the sequence length of the character sequence corresponding to the recognition result, and compare the sequence length with the character length constraint; If the sequence length is inconsistent with the character length constraint, the restoration process will be terminated and a length error message will be output.

[0019] Before performing complex candidate character screening and combination construction, the overall length of the recognition result is pre-checked. The fixed length constraint of the unified social credit code is used to quickly eliminate invalid inputs with inconsistent lengths, avoid subsequent invalid calculations, and improve the system processing efficiency.

[0020] Another embodiment of the present invention provides a character recognition system for commercial attachment images based on OCR, comprising: The first module is used to perform optical character recognition on the business expansion attachment image containing the unified social credit code, obtain the recognition result, determine at least one fuzzy character position based on the recognition result, and determine the candidate character set in each of the fuzzy character positions; The second module is used to obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; The third module is used to filter the candidate character set corresponding to each fuzzy character position according to the legal character set of each character position, determine the legal target candidate character set on each fuzzy character position, substitute each target candidate character set into the corresponding fuzzy character position to obtain at least one character sequence, and construct a candidate complete code sequence based on the character sequence. The fourth module is used to verify all the candidate complete code sequences using the verification algorithm, so as to determine the restored unified social credit code based on the verification results.

[0021] Furthermore, the fourth module includes: The traversal unit is used to traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm, generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. The judgment unit is used to judge the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then the corresponding candidate complete code sequence is determined as the restored unified social credit code. The calculation unit is configured to, if the first round of the pass set contains multiple candidate complete code sequences, calculate the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round of the pass set, and select the candidate complete code sequence with the highest sum of confidence scores as the restored unified social credit code. Attached Figure Description

[0022] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0023] Figure 1 This is a flowchart illustrating an embodiment of the character recognition method for business expansion attachment images based on OCR provided in this application; Figure 2 This is a flowchart illustrating one embodiment of steps S201 to S203 provided in this application; Figure 3 This is a flowchart illustrating one embodiment of steps S301 to S303 provided in this application; Figure 4 This is a flowchart illustrating one embodiment of steps S401 to S403 provided in this application; Figure 5 This is a schematic diagram of the structure of an embodiment of the OCR-based character recognition system for business attachment images provided in this application. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0025] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.

[0026] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.

[0027] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0028] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.

[0029] In the description of the embodiments of this application, the term "multiple" refers to two or more (including two), similarly, "multiple sets" refers to two or more (including two sets), and "multiple pieces" refers to two or more (including two pieces).

[0030] In the description of the embodiments of this application, unless otherwise expressly specified and limited, technical terms such as "installation," "connection," "joining," and "fixing" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. For those skilled in the art, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.

[0031] See Figure 1 To accurately reconstruct the correct Unified Social Credit Code when the seal is obscured or the image is blurry, an embodiment of the present invention provides a character recognition method for business expansion attachment images based on OCR, including steps S101 to S104: Step S101: Perform optical character recognition on the business expansion attachment image containing the unified social credit code to obtain the recognition result. Based on the recognition result, determine at least one fuzzy character position and determine the candidate character set in each fuzzy character position. Please refer to Figure 2In some embodiments, the step of performing optical character recognition on the business expansion attachment image containing the unified social credit code, obtaining a recognition result, and determining at least one fuzzy character position based on the recognition result includes steps S201 to S203: Step S201: Perform image preprocessing on the business expansion attachment image containing the unified social credit code to obtain optimized image data; In some embodiments, step S201 includes: acquiring the business expansion attachment image containing the unified social credit code; performing denoising processing on the business expansion attachment image to obtain a first processing result; performing contrast enhancement processing on the first processing result to obtain a second processing result; and performing size normalization processing on the second processing result to obtain optimized image data. Specifically, firstly, after acquiring the business expansion attachment image containing the unified social credit code, a mean square filtering algorithm is applied to the business expansion attachment image for denoising processing. The business expansion attachment image is traversed through a sliding window, and the grayscale values ​​of all pixels within the window are sorted. The median value is then taken as the new grayscale value of the window's center point. This effectively filters out impulse noise while better preserving high-frequency details such as character edges, avoiding blurring of character outlines due to excessive smoothing, thus obtaining the first processing result. Subsequently, histogram equalization was applied to the first processing result to enhance contrast. By statistically analyzing the distribution of gray levels in the image, pixel values ​​concentrated in a narrow gray range in the original image were mapped to the entire gray-level dynamic range. This stretched the gray-level difference between the characters and the background, making character areas that were previously difficult to distinguish due to insufficient lighting or low contrast clearer and more prominent, resulting in the second processing result. Finally, the second processing result underwent size normalization. The image was uniformly scaled to a fixed height and width according to a preset standard size, eliminating character size differences caused by inconsistent shooting distances or scanning resolutions, while maintaining the aspect ratio of the area containing the unified social credit code. This resulted in optimized image data with a uniform format, clear texture, and distinct character-background boundaries.

[0032] Step S202: Optical character recognition technology is used to recognize the optimized image data to obtain a first character sequence and a confidence sequence corresponding to each character; In some embodiments, optimized image data is input into an OCR model. A convolutional neural network extracts character region features from the image, and then a recurrent neural network combined with an attention mechanism decodes the character sequence bit by bit, outputting a preliminary character recognition result. During this decoding process, the model not only outputs the specific character content for each character position but also generates a confidence score value corresponding to that recognition result based on its internal probability distribution, representing the model's degree of certainty regarding the character recognition result for that position. Then, the recognition results for all character positions are concatenated sequentially to form the first character sequence; the confidence scores corresponding to each character position are arranged in the same order to form the confidence score sequence. For example, when recognizing the unified social credit code in a business license image, the model might output an 18-bit character sequence, along with a confidence score between 0 and 1 for each character; the higher the score, the more reliable the recognition result for that position.

[0033] It should be noted that the training process of the OCR recognition model is as follows: First, a training dataset covering various business-related image scenarios is constructed. This dataset includes original images of various certificates such as business licenses and identity documents. Data augmentation techniques such as random rotation, affine transformation, Gaussian noise simulation, random contrast adjustment, and stamp texture overlay are applied to the original images to generate a large number of training samples simulating uneven lighting, partial occlusion, and complex backgrounds in real business environments. All samples are precisely labeled at the character level, forming a mapping relationship between image regions and corresponding character sequences. The model adopts a CRNN+CTC (Convolutional Recurrent Neural Network and Connectionist Temporal Classification) architecture. The convolutional neural network part is responsible for extracting high-dimensional feature maps from the input image, while the recurrent neural network part uses a bidirectional long short-term memory network to capture the temporal dependencies in the feature maps. The CTC module serves as the loss function, achieving end-to-end sequence learning without requiring alignment of character and image positions. During training, the model continuously optimizes network parameters through backpropagation algorithm, minimizing the edit distance between the predicted character sequence and the labeled ground truth, until the loss value converges on the validation set and the recognition accuracy tends to stabilize, ultimately obtaining an OCR model that can be used for character recognition in actual professional attached images.

[0034] Step S203: Compare each confidence level in the confidence level sequence with a preset confidence level threshold, and mark the character positions with confidence levels lower than the confidence level threshold as fuzzy character positions.

[0035] In some embodiments, a confidence threshold is pre-set based on the accuracy requirements of the business scenario, for example, it can be set to 0.90. Then, the system iterates through the confidence sequence, comparing the confidence value of each character position with the preset threshold. If the confidence value of a character position is greater than or equal to the threshold, the recognition result of that character position is determined to be reliable, and it is considered a high-confidence character position, directly adopting its character content in subsequent processing. Conversely, if the confidence value of a character position is lower than the threshold, the character position is determined to have a recognition anomaly, and it is marked as an ambiguous character position.

[0036] This improves image quality through image preprocessing, providing clearer input for subsequent recognition. The introduction of a confidence threshold mechanism can accurately locate uncertain character positions in the OCR recognition results caused by seal obstruction or image blurring, thus clearly distinguishing between "suspicious areas" and "reliable areas," laying the foundation for subsequent targeted restoration.

[0037] In some embodiments, determining the candidate character set for each fuzzy character position includes: obtaining an initial candidate character list corresponding to each fuzzy character position and a recognition confidence score for each candidate character from the recognition results; sorting the candidate characters in the initial candidate character list based on the recognition confidence scores, and selecting a preset number of candidate characters at the top of the sorted list to form the candidate character set for the fuzzy character position. Specifically, when a character position is marked as a fuzzy character position, the system obtains an initial candidate character list corresponding to that fuzzy character position from the recognition results. Each candidate character in the list is accompanied by its recognition confidence score. After obtaining the list, the system sorts the candidate characters in descending order of confidence score, placing the candidate character with the highest confidence score at the top, and so on, forming an ordered candidate sequence. Subsequently, the system selects a preset number of candidate characters, such as the top 3, to form the candidate character set for that fuzzy character position, which is used to subsequently construct a complete candidate code sequence.

[0038] This approach, for each ambiguous character position, not only retains the preferred recognition result but also extracts multiple candidate characters and their confidence levels, preserving various possible correction directions. It provides ample candidate space for subsequent filtering and restoration using encoding rules, avoiding situations where restoration is impossible due to a single recognition error.

[0039] Step S102: Obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; In some embodiments, the encoding rule set is obtained by calling a pre-built unified social credit code encoding standard database. This database is constructed based on the unified social credit code encoding rules issued by the state, which fully defines the 18-character length of the code, the legal character range for each character, and the verification algorithm for calculating the 18th check digit based on the first 17 characters. Specifically, the legal character set explicitly specifies the allowed character types for each character. The unified social credit code can only use Arabic numerals 0 to 9 and uppercase English letters A to Z. To avoid confusion with numbers, the letters I, O, Z, S, and V are explicitly excluded from the legal character set. The verification algorithm is a mathematical rule for calculating the 18th check digit based on the first 17 characters. The algorithm first converts each character into its corresponding numerical value according to a preset mapping table, then multiplies the numerical value of each character by the corresponding weighting factor, sums the products, and takes the modulo 31. The result is the numerical value corresponding to the 18th check digit. Finally, this numerical value is converted back into the corresponding character through a reverse mapping.

[0040] It should be noted that, except for a few specific positions such as the first registration management department code and the second organization category code which have additional restrictions, the legal character set for the remaining character positions is the above-mentioned 0-9 plus uppercase letters excluding I, O, Z, S, and V.

[0041] Step S103: Based on the legal character set of each character position, filter the candidate character set corresponding to each fuzzy character position, determine the legal target candidate character set on each fuzzy character position, substitute each target candidate character set into the corresponding fuzzy character position to obtain at least one character sequence, and construct a candidate complete code sequence based on the character sequence; In some embodiments, the encoding rule set further includes a character length constraint; before filtering the candidate character sets corresponding to each fuzzy character position according to the legal character set of each character position, the method further includes: obtaining the sequence length of the character sequence corresponding to the recognition result, and comparing the sequence length with the character length constraint; if the sequence length is inconsistent with the character length constraint, the restoration process is terminated and a length abnormality prompt is output. Specifically, according to the national standard of the unified social credit code, its character length is fixed at 18 bits, so the system uses 18 bits as the judgment benchmark for the length constraint. If the statistically obtained sequence length is exactly 18 bits, it is determined that the length verification has passed and the subsequent candidate character set filtering process is allowed; if the sequence length is not equal to 18 bits, for example, due to character omission during OCR recognition, misidentification of seal texture as characters resulting in extra digits, or character loss due to image edge cropping, etc., resulting in a sequence length of 17 bits or 19 bits, the system determines that the recognition result has obvious abnormalities. At this point, the system immediately terminates the subsequent restoration process, stops filtering candidate character sets and constructing and verifying candidate complete code sequences, and directly outputs a length error message to the user interface or upper-level business system.

[0042] Before performing complex candidate character screening and combination construction, the overall length of the recognition result is pre-checked. The fixed length constraint of the unified social credit code is used to quickly eliminate invalid inputs with inconsistent lengths, avoid subsequent invalid calculations, and improve the system processing efficiency.

[0043] In some embodiments, the step of filtering the candidate character sets corresponding to each fuzzy character position according to the legal character set of each character position to determine the legal target candidate character set for each fuzzy character position includes: determining the candidate characters in the candidate character set of each fuzzy character position that belong to the corresponding legal character set as the legal candidate characters corresponding to each fuzzy character position; and sorting the legal candidate characters of each fuzzy character position in descending order of confidence level to obtain the target candidate character set. Specifically, since the character ranges allowed for different character positions in the unified social credit code differ in the coding specification, for example, the registration management department code position only allows specific numbers or uppercase letters, while the main identifier code position allows numbers 0-9 and uppercase letters excluding I, O, Z, S, and V in AZ, it is necessary to select the legal character set defined for each character position in the coding rule set according to the specific position of each fuzzy character position. Subsequently, for each fuzzy character position, the system traverses each candidate character in its candidate character set and checks whether the candidate character belongs to the legal character set corresponding to that character position. For candidate characters belonging to the legal character set, the system identifies them as legal candidate characters for that fuzzy character position. For candidate characters not belonging to the legal character set, such as the letter "Z" appearing as a candidate in the main identifier code position, which has been explicitly excluded from the legal character set, the system directly removes the candidate character and does not include it in subsequent processing. After completing the legality screening, the system sorts the remaining legal candidate characters for each fuzzy character position according to their original recognition confidence in the candidate character list from high to low, placing the legal candidate character with the highest confidence at the first position, the second highest at the second position, and so on, thus forming the target candidate character set for that fuzzy character position.

[0044] By leveraging the inherent character validity constraints of the unified social credit code, candidate characters for each fuzzy character position are pre-screened, and illegal characters that do not meet the character set requirements for that character position are eliminated. This effectively reduces the size of candidate combinations and improves the efficiency and accuracy of subsequent verification.

[0045] Please refer to Figure 3 In some embodiments, substituting each of the target candidate character sets into the corresponding fuzzy character positions to obtain at least one character sequence includes steps S301 to S303: Step S301: Use the content of the non-fuzzy character positions in the recognition result as the basic part sequence; In some embodiments, all character positions not marked as fuzzy character positions are identified from the first character sequence. These character positions are considered high-confidence character positions, and their identification results are considered accurate and reliable. The system extracts the contents of these high-confidence character positions according to their original order in the sequence to form a basic part sequence. For example, assuming that in an 18-digit unified social credit code sequence, positions 1 to 8 and positions 11 to 18 are high-confidence character positions, while positions 9 and 10 are marked as fuzzy character positions, the system concatenates the characters of positions 1 to 8 with the characters of positions 11 to 18 in sequence to form a basic part sequence containing 16 definite characters, where the positions of positions 9 and 10 are temporarily left vacant as filling targets for subsequent substitution operations.

[0046] Step S302: Iterate through each fuzzy character position in sequence, replace the currently traversed fuzzy character position with each candidate character in the target candidate character set corresponding to each fuzzy character position, generate the intermediate character sequence of the current round, and use the intermediate character sequence of the current round as the base character sequence for the next fuzzy character position traversal; In some embodiments, for the currently traversed fuzzy character position, the system obtains the target candidate character set corresponding to that position, then sequentially selects each candidate character from the candidate character set, and substitutes the currently selected candidate character into the corresponding position of the fuzzy character position in the base part sequence to generate the intermediate character sequence for the current round. Subsequently, the system uses this intermediate character sequence as the base character sequence for the next fuzzy character position traversal. If there are multiple candidate characters for the current fuzzy character position, the system will generate an independent intermediate character sequence branch for each candidate character, and each branch will independently enter the processing flow of the next fuzzy character position.

[0047] For example, when processing the 9th fuzzy character position, if the target candidate character set contains two candidate characters, "8" and "3", the system first selects "8" to substitute into the 9th position. At this time, an intermediate character sequence is generated that includes the filling result of the 9th position, while the other positions are still high-confidence characters or not yet filled. When processing the 10th fuzzy character position, the sequence of the already filled 9th position will be used as a new starting point, and then the candidate characters "0" and "6" for the 10th position will be substituted in sequentially to generate a more complete intermediate character sequence.

[0048] Step S303: Continue until all ambiguous character positions have been traversed and replaced, and finally determine all the intermediate character sequences generated as the character sequence.

[0049] In some embodiments, after the substitution operation for the last fuzzy character position is completed, all empty positions in the intermediate character sequences have been filled, forming a complete 18-bit character sequence. The system then summarizes all the intermediate character sequences generated at this point, thus forming the final character sequence set. For example, if there are two fuzzy character positions, with two candidate characters for the 9th position and two candidate characters for the 10th position, then during the traversal process, there will be 2 times 2, a total of 4 permutations and combinations, ultimately generating 4 complete character sequences.

[0050] This method of traversing and replacing systematically combines candidate characters for each ambiguous character position with high-confidence characters, which can exhaustively generate all possible complete code sequences, ensuring that no potentially correct candidate combinations are missed, and providing a complete candidate space for final verification.

[0051] In some embodiments, a candidate complete code sequence is constructed based on the character sequence. Specifically, an 18-bit character sequence formed by each combination of candidate characters in all fuzzy character positions is collected one by one to form a candidate complete code sequence.

[0052] Step S104: Verify all the candidate complete code sequences using the verification algorithm, and determine the restored unified social credit code based on the verification results.

[0053] Please refer to Figure 4 The step of using the verification algorithm to verify all the candidate complete code sequences, and determining the restored unified social credit code based on the verification results, includes steps S401 to S403: Step S401: Traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm, generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. In some embodiments, the first 17 characters of the currently traversed candidate complete code sequence are extracted. Following the verification algorithm rules of the Unified Social Credit Code, each character is first converted into a corresponding numerical value through a preset mapping table. Then, the numerical value of that character is multiplied by its corresponding weighting factor. The products of the first 17 characters are summed, and the sum is modulo 31 to obtain a value between 0 and 30. Finally, this value is converted into the corresponding check code character through reverse mapping, thus generating the theoretical check code. Subsequently, the system compares this theoretical check code with the 18th character in the current candidate complete code sequence. If they match, it indicates that the sequence mathematically conforms to the encoding rules of the Unified Social Credit Code, and the system includes it in the first round of passing sets. In some embodiments, if the two are inconsistent, it indicates that there is an error in the fuzzy character padding scheme in the sequence, and the system will directly remove it and no longer include it in subsequent processing.

[0054] Step S402: Determine the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then determine the corresponding candidate complete code sequence as the restored unified social credit code. In some embodiments, after verifying and comparing all candidate complete code sequences, the number of candidate complete code sequences in the first-round pass set is counted. If the first-round pass set contains only one candidate complete code sequence, it means that among all possible combinations of fuzzy character padding, only one combination can pass the verification algorithm. At this time, the system determines this unique sequence as the restored unified social credit code.

[0055] Step S403: If the first round pass set contains multiple candidate complete code sequences, calculate the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round pass set, and select the candidate complete code sequence with the highest sum of confidence scores as the restored unified social credit code.

[0056] In some embodiments, if the system determines that the first-round pass set contains multiple candidate complete code sequences, it indicates that multiple different fuzzy character filling schemes can pass the verification algorithm. In this case, a further filtering mechanism is needed to determine the final restoration result. For each candidate complete code sequence in the first-round pass set, the system traces back the fuzzy character candidate character information associated with it during the construction process in step S303, extracts the recognition confidence level corresponding to the candidate character used for each fuzzy character position in the sequence, and sums the confidence levels of these fuzzy character positions to calculate the sum of the confidence levels of the sequence. Since the confidence level of each candidate character reflects the degree of certainty of the OCR model regarding the character recognition result, the higher the sum of confidence levels, the closer the sequence as a whole is to the model's original recognition judgment. The system iterates through all sequences in the first-round pass set, calculates the sum of confidence levels for each sequence, and then selects the candidate complete code sequence with the highest sum of confidence levels, identifying it as the restored unified social credit code.

[0057] In this way, the candidate sequences are logically verified by the verification algorithm, and the strong verification rules of the unified social credit code are used to eliminate mathematically invalid incorrect combinations. When there are multiple sequences that pass the verification, the best one is selected based on the sum of confidence scores, so as to ensure that the most likely correct restoration result can be output even in the case of multiple solutions.

[0058] This invention, through confidence analysis of optical character recognition results, determines fuzzy character positions and their candidate character sets. This effectively identifies "suspicious points" in the recognition process and preserves multiple possible correction directions, providing foundational data for subsequent accurate reconstruction. Secondly, by introducing the inherent encoding rule set of the unified social credit code, especially the legal character set for each character position, a first-level strict screening of candidate characters at fuzzy character positions can be performed, eliminating candidate characters that do not conform to the character composition rules, thereby reducing the scale of candidate combinations at the source. Thirdly, by substituting the filtered legal target candidate character set into the fuzzy character position to construct a complete candidate code sequence, the transformation from "character-level candidate" to "code-level candidate" is realized, creating conditions for verification using deeper encoding rules. Finally, a verification algorithm is used to perform a second verification on all candidate complete code sequences. Through mathematical logic consistency checks, all erroneous combinations that do not meet the verification rules are eliminated. When the verification result is unique, the final result is locked. The entire solution deeply integrates the fuzzy recognition results of OCR with the strong verification rules of the unified social credit code, enabling the unique and accurate reconstruction of the correct unified social credit code from a variety of uncertain candidate character combinations, significantly improving the robustness and accuracy of recognition in complex image environments.

[0059] like Figure 5 As shown, based on the above method embodiments, corresponding apparatus embodiments are provided; An embodiment of the present invention provides a character recognition system for commercial attachment images based on OCR, comprising: The first module 100 is used to perform optical character recognition on the business expansion attachment image containing the unified social credit code, obtain the recognition result, determine at least one fuzzy character position based on the recognition result, and determine the candidate character set in each of the fuzzy character positions; The second module 200 is used to obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; The third module 300 is used to filter the candidate character set corresponding to each fuzzy character position according to the legal character set of each character position, determine the legal target candidate character set on each fuzzy character position, substitute each target candidate character set into the corresponding fuzzy character position to obtain at least one character sequence, and construct a candidate complete code sequence based on the character sequence. The fourth module 400 is used to verify all the candidate complete code sequences using the verification algorithm, so as to determine the restored unified social credit code based on the verification results.

[0060] In some embodiments, the fourth module 400 includes: The traversal unit is used to traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm, generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. The judgment unit is used to judge the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then the corresponding candidate complete code sequence is determined as the restored unified social credit code. The calculation unit is configured to, if the first round of the pass set contains multiple candidate complete code sequences, calculate the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round of the pass set, and select the candidate complete code sequence with the highest sum of confidence scores as the restored unified social credit code.

[0061] It is understood that the above-described device embodiments correspond to the method embodiments of the present invention, and can implement the character recognition method for business attachment images based on OCR provided by any of the above-described method embodiments of the present invention.

[0062] It should be noted that the device embodiments described above are merely illustrative, and some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can specifically be implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.

[0063] Based on the above embodiments of the character recognition method for business expansion attachment images based on OCR, another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements the character recognition method for business expansion attachment images based on OCR of any embodiment of the present invention.

[0064] For example, in this embodiment, the computer program can be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the terminal device.

[0065] The terminal device may be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor and a memory.

[0066] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the terminal device, connecting all parts of the terminal device via various interfaces and lines.

[0067] Based on the above-described method embodiments, another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the character recognition method for OCR-based commercial attachment images described in any of the above-described method embodiments of the present invention.

[0068] The modules / units integrated in the device / terminal equipment, if implemented as software functional units and sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0069] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.

Claims

1. A character recognition method for commercial attachment images based on OCR, characterized in that, include: Optical character recognition is performed on the business expansion attachment image containing the unified social credit code to obtain the recognition result. Based on the recognition result, at least one fuzzy character position is determined, and a candidate character set is determined for each fuzzy character position. Obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; Based on the legal character set of each character position, the candidate character set corresponding to each fuzzy character position is filtered to determine the legal target candidate character set at each fuzzy character position. Each target candidate character set is then substituted into the corresponding fuzzy character position to obtain at least one character sequence. A candidate complete code sequence is then constructed based on the character sequence. The verification algorithm is used to verify all the candidate complete code sequences, and the restored unified social credit code is determined based on the verification results.

2. The character recognition method for business attachment images based on OCR according to claim 1, characterized in that, The step of performing optical character recognition on the business expansion attachment image containing the unified social credit code, obtaining the recognition result, and determining at least one fuzzy character position based on the recognition result includes: Image preprocessing is performed on the business expansion attachment image containing the unified social credit code to obtain optimized image data; The optimized image data is identified using optical character recognition technology to obtain a first character sequence and a confidence sequence corresponding to each character; Each confidence level in the confidence sequence is compared with a preset confidence threshold, and the character positions with confidence levels lower than the confidence threshold are marked as fuzzy character positions.

3. The character recognition method for business attachment images based on OCR according to claim 2, characterized in that, The step of preprocessing the attached image containing the unified social credit code to obtain optimized image data includes: The image of the business expansion attachment is subjected to denoising processing to obtain the first processing result; The first processing result is subjected to contrast enhancement processing to obtain the second processing result; The second processing result is then normalized to obtain optimized image data.

4. The character recognition method for business expansion attachment images based on OCR according to claim 1, characterized in that, The determination of the candidate character set in each of the fuzzy character positions includes: From the recognition results, obtain the initial candidate character list corresponding to each of the fuzzy character positions and the recognition confidence level corresponding to each candidate character; Based on the recognition confidence level, the candidate characters in the initial candidate character list are sorted, and a preset number of candidate characters at the top of the sort are selected to form the candidate character set in the fuzzy character position.

5. The character recognition method for business attachment images based on OCR according to claim 1, characterized in that, The step of filtering the candidate character sets corresponding to each fuzzy character position based on the legal character set for each character position, and determining the legal target candidate character set for each fuzzy character position, includes: The candidate characters in the candidate character set of each fuzzy character position that belong to the corresponding legal character set are determined as the legal candidate characters corresponding to each fuzzy character position. The legal candidate characters for each of the fuzzy character positions are sorted in descending order of confidence level to obtain the target candidate character set.

6. The character recognition method for business attachment images based on OCR according to claim 1, characterized in that, The step of substituting each of the target candidate character sets into the corresponding fuzzy character positions to obtain at least one character sequence includes: The content of the non-fuzzy character positions in the recognition result is used as the basic part sequence; Each fuzzy character position is traversed sequentially, and the currently traversed fuzzy character position is replaced with each candidate character in the target candidate character set corresponding to each fuzzy character position, generating the intermediate character sequence of the current round, and using the intermediate character sequence of the current round as the base character sequence for the next fuzzy character position traversal; The process continues until all ambiguous character positions have been traversed and replaced, at which point the final generated sequence of intermediate characters is determined as the character sequence.

7. The method for character recognition of business attachment images based on OCR according to claim 1, characterized in that, The step of verifying all the candidate complete code sequences using the verification algorithm, and determining the restored unified social credit code based on the verification results, includes: Traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm to generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. Determine the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then determine the corresponding candidate complete code sequence as the restored unified social credit code. If the first round of the pass set contains multiple candidate complete code sequences, then the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round of the pass set is calculated, and the candidate complete code sequence with the highest sum of confidence scores is selected as the restored unified social credit code.

8. The OCR-based character recognition method for commercial attachment images according to any one of claims 1-7, characterized in that, The encoding rule set also includes character length constraints; before filtering the candidate character sets corresponding to each fuzzy character position based on the legal character sets for each character position, the method further includes: Obtain the sequence length of the character sequence corresponding to the recognition result, and compare the sequence length with the character length constraint; If the sequence length is inconsistent with the character length constraint, the restoration process will be terminated and a length error message will be output.

9. A character recognition system for commercial attachment images based on OCR, characterized in that, include: The first module is used to perform optical character recognition on the business expansion attachment image containing the unified social credit code, obtain the recognition result, determine at least one fuzzy character position based on the recognition result, and determine the candidate character set in each of the fuzzy character positions; The second module is used to obtain the encoding rule set of the unified social credit code, wherein the encoding rule set includes the legal character set for each character position and the verification algorithm; The third module is used to filter the candidate character set corresponding to each fuzzy character position according to the legal character set of each character position, determine the legal target candidate character set on each fuzzy character position, substitute each target candidate character set into the corresponding fuzzy character position to obtain at least one character sequence, and construct a candidate complete code sequence based on the character sequence. The fourth module is used to verify all the candidate complete code sequences using the verification algorithm, so as to determine the restored unified social credit code based on the verification results.

10. The character recognition system for business attachment images based on OCR according to claim 9, characterized in that, The fourth module includes: The traversal unit is used to traverse the candidate complete code sequence, calculate the first 17 bits of the currently traversed candidate complete code sequence according to the verification algorithm, generate a theoretical check code, and compare the theoretical check code with the 18th character of the current candidate complete code sequence. If the comparison is consistent, the candidate complete code sequence is included in the first round pass set. The judgment unit is used to judge the number of sequences in the first round pass set. If the first round pass set contains only one candidate complete code sequence, then the corresponding candidate complete code sequence is determined as the restored unified social credit code. The calculation unit is configured to, if the first round of the pass set contains multiple candidate complete code sequences, calculate the sum of the confidence scores of the candidate characters selected for each fuzzy character position in each sequence in the first round of the pass set, and select the candidate complete code sequence with the highest sum of confidence scores as the restored unified social credit code.