[0033] The present invention is based on the character feature-based paper shredder broken document recovery method. First, the broken Chinese document is converted into a digital image, and then the broken document is extracted by the image preprocessing method on the computer, and then the broken document is extracted according to the structural characteristics of the Chinese character The documents are spliced together to realize automatic recovery of broken Chinese documents on the computer. Among them, the image digitization part is to convert the paper-based broken document information into digital images through scanners and other equipment, so that it is not distorted and easy to calculate and process. The image preprocessing part uses histogram equalization, image filtering, image extraction and other means to enhance the fragment image and extract it. The image splicing part establishes the correlation between the different fragments according to the characteristics of the stroke structure of the Chinese characters, thereby splicing each fragment to restore the original information of the Chinese document.
[0034] As we all know, from the written form of Chinese characters, it is a flat block body. All the strokes of the Chinese character are distributed in a plane box in an orderly manner. This is the most obvious feature of the Chinese character from the appearance. This is what we often say The unique "square" structure of Chinese characters. It is generally believed that the basic strokes of Chinese characters are horizontal, vertical, abbreviated, dotted, and folded. Statistics on the frequency of occurrence of various strokes of Chinese characters, Zhang Xingchu et al. in the 1965 "Estimation of the Frequency of Use of Various Strokes of Chinese Characters" by Zhang Xingchu and others showed that horizontal strokes accounted for 31% and vertical strokes accounted for 31%. 16%, strokes accounted for 15%, and dot strokes accounted for 12%; and Zhang Jingxian’s 2004 "Chinese Characters Tutorial" showed that horizontal strokes accounted for 27.68%, vertical strokes accounted for 17.60%, skimmed strokes accounted for 15.95%, dot strokes Accounted for 13.62%. Through comparison, we can find that the horizontal strokes appear most frequently in Chinese characters. At the same time, according to the relevant statistics in the "GB130001 Character Set Chinese Character Order (Stroke Order) Specification", there are a total of 20,902 Chinese characters currently in use, with an average of 12.8 strokes per character, of which 12 strokes have the most Chinese characters, with a total of 1957. In the "Commonly Used Characters List of Modern Chinese", there are 3,500 commonly used Chinese characters, with an average of 9.7 strokes per character, of which 9 strokes have the most Chinese characters, 415 in total. Based on the above data, it can be inferred that in the "GB130001 character set", each Chinese character has an average of 3.54 strokes, while in common Chinese characters, each Chinese character has an average of 2.68 strokes. It can be said that the horizontal strokes are in the entire Chinese character structure. It has the highest frequency and occupies an important position.
[0035] According to the above statistics, combined with the actual situation of shredder fragments, there will be quite a lot of complete or incomplete Chinese character structures in each fragment. According to statistical rules, every complete or incomplete Chinese character structure may appear in each fragment. With multiple "horizontal pen" structures, the method of the present invention utilizes the function of the "horizontal pen" structure in Chinese characters, highlights its structural characteristics, establishes connections between different structures within Chinese characters, matches the structure of Chinese characters in different fragments, and then splices them The entire document image to achieve the purpose of restoring the entire Chinese document.
[0036] The method for restoring documents broken by a paper shredder based on text characteristics of the present invention specifically includes the following steps:
[0037] Step 1. Image digitization:
[0038] Image digitization is to describe image information in digital form without distortion. Because the broken paper Chinese documents cannot be processed by computers, they need to be digitized by scanners and other equipment and converted into digital images, so that they can be processed on the computer through image processing algorithms. In actual scanning, the paper fragments are fixed using the acquisition template, and the fragments should be flattened during fixing to avoid tilting, wrinkles, etc. The collection template can fix multiple paper Chinese document fragments at the same time, and can be scanned and used repeatedly. Use a general-purpose scanner to process paper document fragments, and save the original output image f(x,y) in the BMP format on the computer. Since the BMP format image data is not compressed, the original data information is maximized Saved to a certain extent for the next step.
[0039] Step 2. Image preprocessing:
[0040] The original image f(x, y) is processed in sequence with histogram equalization and image filtering, and then all fragment images are extracted from the background template using the 8-neighbor direction chain code method for subsequent stitching.
[0041] Step 2.1, histogram equalization:
[0042] Because the basic mode of the scanner is used to scan the image, the fragmented image obtained is often dark or overexposed in many cases, and the light and dark details of the picture will be lost. Through histogram equalization, the gray level with a large number of pixels in the image is broadened, and the gray level with a small number of pixels is reduced, so that the image can be clear and uniform.
[0043] The original image is f(x,y), the image after the histogram equalization is g(x,y), both of which are m×n in size, and the gray scale variation range of g(x,y) is 0~ 255.
[0044] First, obtain the gray histogram of the original image f(x,y), which is represented by a 256-dimensional vector H(k). H(k) is called the cumulative probability function, then:
[0045] H(k)=P(f k )=n k /N, k=0,1,2,...255,
[0046] Among them, k refers to the specific gray level, and its value range is 0~255, f k Is the gray value of the k-th level in the original image f(x,y), P(f k ) Is the proportion of k-th gray value in the original image f(x,y), n k Is the number of pixels with gray value k in the original image f(x,y), N is the total number of pixels in the original image f(x,y), N=m×n.
[0047] Secondly, through the cumulative probability function H(k), the original image f(x, y) is equalized and mapped. When the original image f(x, y)=s, then:
[0048] When f(x,y)≠0, s=0,1,2,...255,
[0049] When f(x,y)=0, g(x,y)=0,
[0050] Among them, s refers to different gray levels, and its value range is 0~255.
[0051] Step 2.2, image filtering processing:
[0052] Since there will be many noises at the edges of fragments and templates, these noises have a great influence on further processing, and it is necessary to filter the image to remove the scattered noise.
[0053] First, the image g(x, y) is binarized, and the fragments and the background template are effectively distinguished by selecting an appropriate threshold to obtain the binarized image w(x, y),
[0054] w ( x , y ) = 1 , g ( x , y ) ≥ Th 0 , g ( x , y ) ≤ Th
[0055] Among them, Th is the threshold of the image.
[0056] Secondly, after the binarization process, the noise is concentrated at the left and right and upper and lower edges of the image, which will affect the subsequent fragment extraction. According to the location characteristics of the noise, the noise is eliminated by vertical projection and horizontal projection of the image to obtain a denoised image. The horizontal projection method means that the image is projected in the X-axis direction in columns, and the number of black points on the X-axis is counted. According to the set threshold, the location with less black points is considered as noise, and its value is assigned white to eliminate the left and right edges. Noise; the vertical projection method means that the image is projected to the Y axis in a row, and the number of black dots on the Y axis is counted. According to the set threshold, the position with less black dots is considered as noise, and its value is assigned to white to eliminate the upper and lower edges Noise at the place. Namely: the noise signal is n(x,y), the denoised image is e(x,y), then:
[0057] e(x,y)=w(x,y)-n(x,y).
[0058] Step 2.3, fragment image extraction:
[0059] Since the subsequent splicing needs to extract the fragments from the background template, it can be realized by the method of chain code. The chain code is a coding method for the edge of the fragment, which uses some straight line segments connected with fixed length and direction to represent the fragments. Marginal. Since there are only 8 adjacent pixels around each pixel in the image, set these 8 pixels to the direction from 0 to 7. By determining the position of the pixel P and the code of a certain adjacent pixel, you can know the adjacent pixel Position, so the 8 neighborhood direction chain code can accurately describe the edge of the target image. 8 adjacent direction chain code means:
[0060] 3
[0061] Specific coding process: scan the denoising image e(x,y) in columns from left to right, take the scanned white point as the starting point, number clockwise along the boundary, and record each pair of pixels according to the direction numbering rule The direction numbers of the interline segments, connect the direction numbers in sequence to get the chain code representation of the edge of the fragments. The fragments can be segmented from the background image along the chain codes, and normalized to set them to standardized fragments; repeat the above Process until all fragment images are obtained.
[0062] Step 3. Image stitching:
[0063] Image stitching is the core of the recovery of the entire broken Chinese document. First of all, through the method of image enhancement, the structure of Chinese characters is strengthened. Then, based on the statistical characteristics of Chinese strokes, the interconnection between different fragments is established through strokes. Finally, according to the principle of maximum correlation, the fragmented images are spliced together to recover the entire original Chinese document.
[0064] Step 3.1. After image preprocessing, especially binarization, the extracted Chinese characters are incomplete, which is not conducive to the subsequent information recovery. Therefore, the image enhancement method is adopted, that is, the image is corroded and expanded. In order to enhance the characteristics of its text structure, use the open operation of the binary image for processing. The so-called open operation is an operation in which the image is first corroded and then expanded using the same structural element. It can eliminate small objects in the image, segment the target, smooth the target without significantly changing its area and shape.
[0065] The binary image is F, the structure element is S, and the open operation is defined as:
[0066]
[0067] Among them, FoS means open operation, Represents corrosion, that is, the corrosion of S by F is the translation of the set of all points contained in F in S, Denotes expansion, that is, the expansion of F by S is the collection of all displacement points.
[0068] From the actual effect, in the open operation, the first corrosion can corrode some small burrs of the Chinese characters, and the subsequent expansion correspondingly strengthens the text structure and ensures that the burrs no longer appear.
[0069] Step 3.2. For fragments of Chinese documents in disorderly order, make full use of the characteristics of their "horizontal pen" to join the documents. Considering the important role of "horizontal pen" in the structure of Chinese characters, on average, each Chinese character has about three strokes of " "Horizontal pen" structure, and "horizontal pen" has good directional and linear characteristics. Therefore, we propose to use the "horizontal pen" feature to further highlight the edges of the document fragments. Using its horizontal invariance, Chinese characters are matched, and then the entire Chinese document is spliced.
[0070] Due to the imprecise preprocessing of the image, the stroke structure of the text is incomplete and not tight, especially some of the "horizontal pen" is defective, which is not conducive to the subsequent text matching. Therefore, the "horizontal pen" structure needs to be added make up.
[0071] Build a 5×3 template matrix M:
[0072] 1
[0073] ,
[0074] Search inward along the left and right edges of each document fragment after the open operation. The range of inward search is three pixels. It is judged whether the pixels of the Chinese character can meet the matrix M. If the conditions are met, the Chinese character structure is considered to be a " "Horizontal pen", then extend it directly to the rightmost or leftmost side of the image, otherwise, the Chinese character structure remains unchanged; search the entire fragment, and accurately record the position of the "horizontal pen" at the edge of each Chinese character in the fragment image.
[0075] Step 3.3. Compare the positions of the Chinese character "Hengbi" in the left and right columns of the two document fragment images. If the position of the "Hengbi" is exactly the same, it is considered that there is a Chinese character structure matching, and two fragments are recorded The total number of stroke matches between. Based on the current document fragment image, the above comparison process is repeated with other fragment images. Finally, based on the total number of stroke matches between the two fragments, the two document fragments with the largest total number of stroke matches are adjacent document fragments. At the same time, considering the number of Chinese characters in each fragment image and the structural characteristics between paragraphs of the article, the lower limit of the total number of stroke matches is set. When there are at least five "horizontal pens" in the same position, two The fragment images may be adjacent and can be stitched together.
[0076] At the same time, if only the number of stroke matches is used as a basis, multiple fragment images will often match the same fragment image. In order to avoid this kind of situation, according to the principle of maximum correlation, that is, the fragment image with the largest number of matching "horizontal pens" is the adjacent image, and then the two fragment images can be spliced together.
[0077] When the two fragment images are spliced together, they are regarded as a fragment image. According to the matching conditions, the above process is repeated, and the final image obtained is the restored Chinese document. Save the final result on the computer in BMP format for easy viewing or further processing.
[0078] Through the above steps, the broken Chinese document can be recovered, so that the broken document information can be used again.