Image recognition method, device and computer equipment comprising multiple lines of text
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SF TECH CO LTD
- Filing Date
- 2021-12-27
- Publication Date
- 2026-06-19
Smart Images

Figure CN116363656B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to an image recognition method, apparatus, and computer device containing multi-line text. Background Technology
[0002] With the continuous development of image processing technology, the automatic recognition of text in images using computers has gradually matured. In daily life, it is often necessary to recognize images containing multiple lines of text to obtain the text content within the image. For example, in the scenario of logistics code recognition, it is necessary to recognize logistics code images to obtain the logistics code information contained therein, which facilitates logistics management.
[0003] However, existing text recognition methods are limited to accurate recognition of single-line text. When faced with multi-line text recognition tasks, the text presents a two-line structure, and the numerical text of fixed length in multiple lines is closely arranged with small spacing between lines. This inevitably leads to blurred handwriting, which can easily result in low detection accuracy for each line of text. Therefore, it is easy to encounter abnormal situations such as detection bias and missed detection, which affects the recognition performance.
[0004] Therefore, existing multi-line text recognition methods suffer from low recognition accuracy. Summary of the Invention
[0005] The purpose of this application is to provide an image recognition method, apparatus, and computer device containing multi-line text, so as to improve the character recognition accuracy of multi-line text contained in an image.
[0006] In a first aspect, this application provides an image recognition method containing multi-line text, including:
[0007] Acquire the image to be recognized;
[0008] In response to the fact that the image to be recognized is a target image containing multiple lines of text, the target image is normalized to obtain a normalized target image;
[0009] The normalized target image is input into a trained text recognition model, which outputs the character matching probability. The trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image.
[0010] Based on the character matching probability, determine the text characters of the multi-line text contained in the image to be recognized.
[0011] In some embodiments of this application, the trained text recognition model includes a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer. The process of inputting a normalized target image into the trained text recognition model and outputting character matching probabilities includes: inputting the normalized target image into the trained text recognition model; extracting features from the normalized target image using the feature extraction layer to obtain an image feature map; performing feature dimension analysis on the image feature map using the data transformation layer to obtain an image matrix; classifying characters in the image matrix using the classification layer to obtain a character classification vector; and performing loss analysis on the character classification vector using the connectionist temporal classification layer to obtain the character matching probability.
[0012] In some embodiments of this application, the trained text recognition model further includes a recurrent network layer; wherein, after performing feature dimension analysis on the image feature map through the data transformation layer to obtain an image matrix, the model further includes: performing sequence analysis on the image matrix through the recurrent network layer to obtain a target matrix vector; wherein, the target matrix vector is used for character classification through the classification layer.
[0013] In some embodiments of this application, the data transformation layer includes a dimension splitting network, a dimension exchange network, and a dimension merging network; wherein, the image matrix is obtained by performing feature dimension analysis on the image feature map through the data transformation layer, including: splitting the image feature map into dimensions through the dimension splitting network to obtain the split image feature map; exchanging the dimensions of the split image feature map through the dimension exchange network to obtain the exchanged image feature map; and merging the dimensions of the exchanged image feature map through the dimension merging network to obtain the image matrix.
[0014] In some embodiments of this application, before inputting the normalized target image into the trained text recognition model, the method further includes: constructing an initial text recognition model; the text recognition model consists of a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer; acquiring a multi-line text image set and dividing the multi-line text image set into a training set and a test set; the multi-line text image set includes images of multiple labeled text characters; the text characters are determined by querying a preset character sequence mapping table; training the initial text recognition model using the training set to obtain a pre-trained text recognition model; and testing and adjusting the pre-trained text recognition model using the test set to obtain a trained text recognition model.
[0015] In some embodiments of this application, obtaining a multi-line text image set includes: obtaining multi-line text images and annotating the multi-line text images with text characters to obtain annotated multi-line text images as candidate text images; analyzing the image format, image size, and / or image features of the candidate text images; selecting candidate text images that meet the preset model training conditions based on at least one of the image format, image size, and image features as target text images; and performing data augmentation on the target text images to obtain a multi-line text image set.
[0016] In some embodiments of this application, in response to the image to be identified being a target image containing multi-line text, the target image is normalized to obtain a normalized target image, including: calling a trained text detection model, the trained text detection model including the EAST model; inputting the image to be identified into the trained text detection model to obtain the model output result; in response to the model output result being a multi-line text rectangle, determining that the image to be identified is a target image containing multi-line text; and normalizing the target image based on a preset interpolation method to obtain a normalized target image.
[0017] Secondly, this application provides an image recognition device containing multi-line text, comprising:
[0018] The image acquisition module is used to acquire the image to be recognized;
[0019] The image processing module is used to normalize the target image in response to the image to be recognized being a target image containing multiple lines of text, so as to obtain a normalized target image.
[0020] The text recognition module is used to input the normalized target image into the trained text recognition model and output the character matching probability; wherein, the trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image;
[0021] The character determination module is used to determine the text characters of the multi-line text contained in the image to be recognized based on the character matching probability.
[0022] Thirdly, this application also provides a computer device, comprising:
[0023] One or more processors;
[0024] The memory; and one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the image recognition method containing multi-line text described above.
[0025] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, the computer program being loaded by a processor to perform steps in an image recognition method containing multi-line text.
[0026] Fifthly, embodiments of this application provide a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method provided in the first aspect described above.
[0027] The aforementioned image recognition method, apparatus, and computer equipment for multi-line text involves a server acquiring an image to be recognized. Responding to the image being a target image containing multi-line text, the server normalizes the target image to obtain a normalized target image. This normalized target image is then input into a trained text recognition model, which outputs character matching probabilities. Based on these probabilities, the text characters of the multi-line text contained in the image can be determined. The trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image. By detecting and recognizing multi-line text as a whole, hard detection of individual text lines in the image is avoided, thereby improving the accuracy of character recognition for multi-line text in an image. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This is a schematic diagram of a scenario for the image recognition method containing multi-line text provided in the embodiments of this application;
[0030] Figure 2 This is a flowchart illustrating the image recognition method containing multi-line text provided in the embodiments of this application;
[0031] Figure 3 This is a schematic diagram of the recognition results of the multi-line text image provided in the embodiments of this application. Figure 1 ;
[0032] Figure 4 This is a schematic diagram of the structure of the text detection model provided in the embodiments of this application;
[0033] Figure 5This is a schematic diagram of the recognition results of the multi-line text image provided in the embodiments of this application. Figure 2 ;
[0034] Figure 6 This is a schematic diagram of the structure of the image recognition device containing multi-line text provided in the embodiments of this application;
[0035] Figure 7 This is a schematic diagram of the structure of the computer device provided in the embodiments of this application. Detailed Implementation
[0036] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0037] In the description of this application, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the stated features. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0038] In the description of this application, the term "for example" is used to mean "used as an example, illustration, or description." Any embodiment described as "for example" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use the invention. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that the invention can be made without using these specific details. In other instances, well-known structures and processes will not be described in detail to avoid obscuring the description of the invention with unnecessary detail. Therefore, the invention is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.
[0039] In this application embodiment, the image recognition method containing multi-line text provided mainly relates to computer vision (CV) technology within artificial intelligence (AI). Artificial intelligence utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to obtain optimal results—a theory, method, technology, and application system. In other words, artificial intelligence is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine capable of reacting in a manner similar to human intelligence.
[0040] Computer vision is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing, tracking, and measuring targets, and then performs image processing to create images more suitable for human observation or transmission to instruments for detection. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content / behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and also common biometric recognition technologies such as facial recognition and fingerprint recognition. In this application, for the image to be recognized, CV mainly implements image detection and image recognition in Image Semantic Understanding (ISU), detecting and recognizing target objects in the image and outputting results. It is understood that the target object can be multi-line text.
[0041] In this embodiment of the application, the image recognition method containing multi-line text provided in this embodiment can be applied to, for example... Figure 1The image recognition system shown includes a multi-line text image. This system comprises a terminal 102 and a server 104. The terminal 102 can be a device that includes both receiving and transmitting hardware, i.e., a device with receiving and transmitting hardware capable of performing bidirectional communication over a bidirectional communication link. Such a device can include cellular or other communication devices, having a single-line display, a multi-line display, or no multi-line display. Specifically, the terminal 102 can be a desktop terminal or a mobile terminal; it can also be a mobile phone, tablet computer, or laptop computer. The server 104 can be a standalone server or a server network or server cluster, including but not limited to computers, network hosts, a single network server, multiple network server sets, or a cloud server composed of multiple servers. The cloud server consists of a large number of computers or network servers based on cloud computing. Furthermore, the terminal 102 and the server 104 establish a communication connection through a network, which can be any of a wide area network (WAN), local area network (LAN), or metropolitan area network (MAN).
[0042] Those skilled in the art will understand that Figure 1 The application environment shown is merely one applicable scenario for the solution in this application and does not constitute a limitation on the application scenario of the solution in this application. Other application environments may include more than one. Figure 1 The number of devices shown may be more or less. For example, Figure 1 Only one server is shown. It is understood that this image recognition system containing multi-line text may also include one or more other devices, which are not specified here. Additionally, as... Figure 1 As shown, the image recognition system containing multi-line text may also include a memory for storing data, such as storing the image to be recognized.
[0043] It should be noted that, Figure 1 The schematic diagram of the image recognition system containing multi-line text shown is merely an example. The image recognition system and scenario containing multi-line text described in the embodiments of the present invention are for the purpose of more clearly illustrating the technical solutions of the embodiments of the present invention, and do not constitute a limitation on the technical solutions provided by the embodiments of the present invention. As those skilled in the art will know, with the evolution of image recognition systems containing multi-line text and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present invention are also applicable to similar technical problems.
[0044] See Figure 2 This application provides an image recognition method for multi-line text. This embodiment mainly applies this method to the above-mentioned... Figure 1Taking server 104 as an example, the method includes steps S201 to S204, as follows:
[0045] S201, Obtain the image to be recognized.
[0046] The image to be identified can be an image of the outer packaging of a specified item, or an image of an object collected at a specified business site, including but not limited to pictures, video frames within videos, etc.; the video includes but is not limited to short videos, long videos, etc., short videos can be videos less than 10 minutes long, and long videos can be videos longer than 10 minutes long; the business site includes but is not limited to logistics sites such as parcel sorting sites and item packing sites.
[0047] In specific implementation, server 104 can acquire images to be identified from cameras installed at designated business locations according to business needs. These cameras can be monocular or multi-view cameras. Server 104 can also capture images of designated items according to business needs to obtain images to be identified for subsequent analysis. Of course, server 104 can also acquire images to be identified through other devices according to business needs. For example, it can acquire images captured by cameras installed at designated business locations through terminal 102 as images to be identified. Another example is acquiring images of the outer packaging of designated items through terminal 102 as images to be identified. Thus, this application does not specifically limit the method of acquiring images to be identified.
[0048] Furthermore, the image to be identified currently acquired by server 104 and used as the basis for subsequent processing can also be a pre-processed image. Pre-processing includes, but is not limited to, cleaning and adjustment methods. For example, after acquiring the initial image to be identified through one of the methods listed above, server 104 can perform cleaning and / or adjustment processing on the initial image to be identified, including, but not limited to, cleaning out duplicate or incorrectly readable images, and adjusting the image size, color, etc., to obtain the image to be identified. Thus, the image to be identified can be a single frame or multiple frames.
[0049] S202, in response to the image to be recognized being a target image containing multiple lines of text, the target image is normalized to obtain a normalized target image.
[0050] Multi-line text can be at least two lines long, meaning the text can be arranged side-by-side on different lines. For example, see [link to relevant documentation]. Figure 3 , is an image of the outer packaging of a specified item, and the outer packaging image contains two lines of text in more than one place.
[0051] In the specific implementation, after the server 104 obtains the image to be recognized, it can use a preset algorithm to perform multi-line text detection on the image to be recognized, so as to filter out the target image containing multi-line text as the basis for subsequent analysis. Then, the target image is normalized to adjust the image size of the target image to a specified size, so as to obtain an image that meets the requirements of the subsequent specific image size analysis steps.
[0052] It should be noted that the analysis step involved in this embodiment is a text detection step, the purpose of which is to perform multi-line text detection on the image to be recognized, determine whether the image to be recognized is a target image containing multi-line text, and finally perform text recognition on the target image in order to identify the text characters of the multi-line text contained in the image to be recognized. However, traditional text recognition technology is limited when dealing with multi-line text, such as... Figure 3 When performing character recognition on the two-line text shown, due to the close arrangement of the two fixed-length numeric text lines, the small line spacing, and the blurred characters, anomalies such as misdetection and missed detection always occur, making it difficult to achieve accurate detection for multiple lines of text. For example, Figure 3 After the multi-line text characters in the text are analyzed by traditional text recognition technology, a string of characters corresponding to the two-line text is obtained. However, there is a problem of bias detection, that is, the string that should be detected, “503981171122077659310081”, is detected as “503981171122077658310081”.
[0053] Therefore, this application proposes that before performing text recognition, the multi-line text is treated as a whole and a rotating rectangle is used for text detection. The multi-line text is then treated as a whole for unsegmented double-line recognition. The models used include a text detection model and a text recognition model. The application steps of the text detection model and the text recognition model will be described in detail below.
[0054] In one embodiment, this step includes: invoking a trained text detection model, the trained text detection model including the EAST model; inputting the image to be recognized into the trained text detection model to obtain the model output result; in response to the model output result being a multi-line text rectangle, determining that the image to be recognized is a target image containing multi-line text; and normalizing the target image based on a preset interpolation method to obtain a normalized target image.
[0055] The EAST (Efficient and Accuracy Scene Text Detection Pipeline) model is a fully convolutional network consisting of three main parts: a feature extraction layer, a feature fusion layer, and an output layer. Since the text in an image varies in size, it's necessary to fuse feature maps from different levels. Predicting small text requires using lower-level semantic information, while predicting large text requires using higher-level semantic information.
[0056] In practice, the trained text detection model can be installed on other devices, such as terminal 102. After acquiring the image to be recognized, server 104 can request terminal 102 to call the trained text detection model for text detection. Of course, the trained text detection model can also be installed on server 104, in which case server 104 does not need to send a request to other devices when using it.
[0057] Specifically, when the text detection model uses the EAST model, the EAST model takes the normalized image to be recognized as input and outputs three feature maps "Fe, Fa, and Fs", whose dimensions are 1 / 4 of the input image. The Fs channel is set to "1" and activated using the sigmoid function to predict the position of each pixel in the foreground target (i.e., ...). Figure 3 The probability within the bounding rectangle of the double-line text shown is as follows: The Fe channel is "4", no activation function is used, and it is used to predict the distance of each pixel position from the top, bottom, left, and right sides of the bounding rectangle of the foreground target (values outside the foreground target are zero); the Fa channel is "1", activated using the tanh function, and it predicts the angle of the bounding rectangle of the foreground target at each pixel position (values outside the foreground target are zero). See details in [link to documentation]. Figure 4 , is a structural diagram of the EAST model involved in the embodiment.
[0058] Furthermore, this embodiment proposes using a trained text detection model to analyze and acquire the target image. However, before calling the trained text detection model, it needs to be trained to a certain extent, and the training of the text detection model requires data-amplified sample images. For example, in the data preparation stage before model training, the initial sample images can be labeled and augmented to obtain a set of processed augmented images and a set of corresponding converted labeled data. During labeling, two lines of text can be treated as a single line and enclosed in a rotated rectangle. Data augmentation can improve the model's generalization ability and, to a certain extent, improve the accuracy of the model's predictions.
[0059] Therefore, this application proposes using some preset data augmentation strategies, and then obtaining images and corresponding labeled data ten times the initial data volume through random selection and / or combination. Then, by randomly cropping, proportionally scaling, and padding with "0", the images are transformed to a size of 512*512, and the labeled data is transformed accordingly. The purpose is to convert the rotated rectangular bounding box label format into a form that can be used to calculate the loss with the model's prediction output layer, that is, to convert it into three three-dimensional numerical matrices. Here, the EAST model's own data transformation method is used to obtain the labels "Me, Ma, Ms" corresponding to "Fe, Fa, and Fs".
[0060] The model's loss function is: L = L e +L s +L a , where “Le, Ls, La” represent the losses of “Fe, Fa and Fs” respectively, and are calculated according to the following formulas (1)-(3).
[0061]
[0062]
[0063] L a =1-cos(F a *π / 2-M a (3)
[0064] Wherein, in formula (1) we get and "R" * "Fe" and "Me" represent the areas of the bounding boxes calculated based on "Fe" and "Me" respectively. The training uses an initial value of "1e-4", an exponential decay value of "0.997", a decay step size of "4000", and a batch size of "12" for 10,000 training iterations. The EAST model's own post-processing method is used to obtain the predicted bounding box of the entire double-line text.
[0065] Furthermore, after the text detection model is trained, the server 104 can input the image to be recognized into the trained text detection model to obtain the model output. If the model output is a multi-line text rectangle in the image to be recognized, meaning that the multi-line text rectangle was analyzed and labeled by the model, then the image to be recognized can be determined to be a target image containing multi-line text. It is understandable that if the model output is a single-line text rectangle, or other non-text rectangles, or no multi-line text rectangles, then the image to be recognized can be discarded, and a new image to be recognized can be acquired for text detection until a target image containing multi-line text is obtained.
[0066] S203, the normalized target image is input into the trained text recognition model, and the character matching probability is output; wherein, the trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image.
[0067] The trained text recognition model includes a data transformation layer, which is used to perform feature dimension analysis on the normalized target image. This application proposes to use the data transformation layer for feature dimension analysis in order to solve the shortcomings of traditional image-based dimension analysis.
[0068] The character matching probability can be the matching probability of each number in the character sequence number mapping table. Each number corresponds to a preset character. For example, the character matching probability of a certain character in the character sequence number mapping table is "0.2, 0.35, 0.57, 0.22...0.9". These probability values correspond to each character in the character sequence number mapping table. "0.9" is determined to be the maximum probability after comparison.
[0069] In specific implementation, this application proposes to analyze from the feature dimension rather than the image dimension used in traditional technology. The reason is that traditional technology has some defects in recognizing images, such as not being able to find the boundaries of each line of text, which leads to incorrect text segmentation and thus reduces the accuracy of multi-line text recognition. Therefore, analyzing from the feature dimension can improve the above problems and thus improve accuracy.
[0070] Specifically, after server 104 obtains the normalized target image, it can input the normalized target image into the trained text recognition model to obtain the character matching probability. Then, it can determine the text characters of the multi-line text contained in the image to be recognized based on the character matching probability. Before this, the text recognition model needs to be properly trained. The model training steps for the text recognition model will be described in detail below.
[0071] In one embodiment, prior to this step, the method further includes: constructing an initial text recognition model; the text recognition model consists of a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer; acquiring a multi-line text image set and dividing the multi-line text image set into a training set and a test set; the multi-line text image set includes images of multiple labeled text characters; the text characters are determined by querying a preset character sequence mapping table; training the initial text recognition model using the training set to obtain a pre-trained text recognition model; and testing and adjusting the pre-trained text recognition model using the test set to obtain a trained text recognition model.
[0072] The character sequence number mapping table can be a mapping relationship between characters and numeric sequences. For example, the characters "a", "b", and "c" are mapped to the numeric sequences "0", "1", and "2". It should be noted that the character sequence number mapping table can be a word table Vob of length T, used in the data preparation and model application stages to map each character in the tag string to a sequence number in the word table Vob. The value range is [0, T-1]. Characters not in the word table are called off-table characters or unknown characters, and are uniformly mapped to T.
[0073] In its specific implementation, the text recognition model consists of a feature extraction layer, a data transformation layer, a classification layer, and a Connectionist Temporal Classification (CTC) layer. The server 104 can perform model training before executing the text recognition task, or before acquiring the image to be recognized. This embodiment does not specifically limit when the model training operation is performed, but it is certain that the model training task must be completed before calling the trained model. Furthermore, the model training task can be executed by the server 104, or by other devices that have established a communication connection with the server 104, such as the terminal 102.
[0074] Furthermore, to obtain the trained text recognition model for subsequent steps, an initial text recognition model must first be constructed. Then, the server 104 or other device responsible for performing the model training task can acquire image data for training the model, forming a multi-line text image set. At this time, the server 104 or other device can acquire a small number of images with labeled multi-line text characters. Then, the images are augmented to obtain a large number of images, serving as the multi-line text image set required for subsequent model training. The steps for acquiring the multi-line text image set involved in this embodiment will be described in detail below.
[0075] Furthermore, multi-line text image sets can be used to train models, including but not limited to pre-training and preliminary training. Multi-line text image sets can also be used to debug models, including but not limited to test adjustments. Specifically, if there is a need for preliminary training and test adjustments, after obtaining the multi-line text image set, it can be divided into a training set and a test set. The training set can be used to perform preliminary training on the initial text recognition model, and then the test set can be used to test and adjust the pre-trained text recognition model, resulting in a trained text recognition model.
[0076] It should be noted that the model training stopping conditions that can be selected by those skilled in the art include at least one of the following: (1) the error is less than a certain preset small value; (2) the weight change between two iterations is already very small, a threshold can be set, and training stops when it is less than this threshold; (3) a maximum number of iterations is set, and training stops when the number of iterations exceeds the maximum number, for example, "200 cycles"; (4) the recognition accuracy reaches a certain preset large value. The data augmentation steps involved in this embodiment will be described in detail below.
[0077] In one embodiment, obtaining a multi-line text image set includes: obtaining multi-line text images and annotating the multi-line text images with text characters to obtain annotated multi-line text images as candidate text images; analyzing the image format, image size, and / or image features of the candidate text images; selecting candidate text images that meet the preset model training conditions based on at least one of the image format, image size, and image features as target text images; and performing data augmentation on the target text images to obtain a multi-line text image set.
[0078] Data augmentation can be viewed as a smooth conversion process from one type of image to another. In this embodiment, data augmentation may include at least one of the following: perspective transformation, Gaussian blur, noise addition, and HSV (hsvimage) channel color transformation. In addition, data augmentation may also include: brightness adjustment, contrast adjustment, pixel adjustment, angle adjustment, noise adjustment, Mosaic enhancement, Mixup enhancement, etc.
[0079] In a specific implementation, the annotation tool used to annotate text characters can be labellmg, which is written in Python. It supports cross-platform operation such as Windows and Linux, and for a specified target object, such as multi-line text, it can be marked by drawing a box through a visual operation interface.
[0080] Furthermore, before acquiring the multi-line text image set, server 104 can first acquire multi-line text images. These multi-line text images can be images from terminal 102 or other devices, or images pre-stored in the database by server 104. After acquiring the multi-line text images, server 104 can use the annotation tools described above or other methods to annotate the multi-line text images with text characters to obtain candidate text images.
[0081] However, the candidate text images obtained at this time cannot be directly used to form a multi-line text image set, because they are very likely to contain images that cannot be used for training. Therefore, in order to avoid affecting the training effect, the server 104 needs to filter out abnormal images that cannot be read normally, are too small, or are duplicates after obtaining the candidate text images. That is, it needs to obtain at least one of the three data of the candidate text images: image format, image size, and image features, in order to filter out candidate text images that meet the preset model training conditions and use them as target text images.
[0082] Furthermore, since the imaging modes and quality of cameras from different brands often vary greatly, models trained using specific types of images are usually incompatible with other types of images, resulting in insufficient model recognition accuracy. Therefore, embodiments of this application propose employing one or more of the above data augmentation strategies to augment the target text image, making the distribution of different types of images more continuous, thereby improving the model's generalization ability and ultimately increasing the recognition accuracy of multi-line text characters.
[0083] For example, motion blur can be randomly added to an image in a specific direction: select a direction from "0-359" degrees, add motion blur to image X in that direction, and then input the motion-blurred image X into three different models for training. Alternatively, Gaussian noise can be randomly added: using Python's imgaug library, Gaussian noise can be added to image X, with each pixel sampled once from a normal distribution N(0, 0.05*255). The Gaussian noise-added image X can then be input into three different models for training. By using these data augmentation methods, images can be fitted to different camera imaging modes, allowing the models to become familiar with various imaging modes and thus improving the accuracy of two-line text recognition.
[0084] In one embodiment, the trained text recognition model includes a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer. The process of inputting a normalized target image into the trained text recognition model and outputting character matching probabilities includes: inputting the normalized target image into the trained text recognition model; extracting features from the normalized target image using the feature extraction layer to obtain an image feature map; performing feature dimension analysis on the image feature map using the data transformation layer to obtain an image matrix; classifying characters in the image matrix using the classification layer to obtain a character classification vector; and performing loss analysis on the character classification vector using the connectionist temporal classification layer to obtain the character matching probability.
[0085] In the specific implementation, after the server 104 obtains the normalized target image, it can input it into the trained text recognition model, facilitating image analysis and processing sequentially through the feature extraction layer, data transformation layer, classification layer, and CTC layer. The normalized target image has dimensions of "H*W*C", where "H" and "W" represent the height, width, and number of image channels, respectively. Grayscale images typically have "1" and color images typically have "3". In addition, the normalized target image also contains image annotations, "Labels," which are sequences of character numbers in a pre-defined character number mapping table.
[0086] Specifically, the feature extraction layer refers to a convolutional network composed of convolutional layers, pooling layers, and normalization layers, connected through adjacent layers or skip connections. It downsamples the input image by a preset factor (e.g., 32 times) in both the height and width directions. Its input is an image I, and its output is a three-dimensional feature map F with the shape [H...]. f W f C f ](H f =H / 32, W f =W / 32, C f (This refers to the number of feature channels). As mentioned above, H at this time... f It is a multiple of 2.
[0087] Furthermore, the input to the data transformation layer is the feature map F output by the feature extraction layer, and the output is a two-dimensional matrix M. The data transformation does not change the total number of values (the total number of values is H). f *W f *C f This only changes the shape of the matrix and the order of its internal values. The classification layer is a fully connected layer (FC layer) with an input dimension of C. r (If a recurrent network layer is used, the recurrent network layer will be explained below) or C f (Without using recurrent network layers), the output dimension is "T+2", corresponding to the following classifications: characters from the character set (quantity T, corresponding category index range [0, T-1]), unknown characters (corresponding category index T), and blank or non-character (denoted as CTC_Blank, corresponding to category index T+1). Each vector in R (if no RNN layer is used, let matrix M be expanded along its first dimension to obtain a column vector R) is passed through the classification layer to obtain a classification vector P with dimension T+2. The CTC layer is specifically used to calculate the CTC loss. Its input is the classification vector P obtained from the classification layer and the image label Label, and its output is a numerical value ctc_loss, representing the CTC loss. The model parameters can be optimized by taking the derivative of ctc_loss with respect to the model variables.
[0088] Furthermore, as explained in detail above, the function of each layer is as follows: In practical applications, after normalizing the target image using the method described above, it is input into the trained text recognition model. From the classification layer, a series of classification vectors P are obtained, where each vector represents the probability of T+2 categories, i.e., the character matching probability. The character corresponding to the position with the highest probability is selected, resulting in a series of characters. Then, removing the "CTC_Blank" string mentioned above and merging adjacent identical characters yields the final result, which is the text characters of the multi-line text contained in the image to be recognized.
[0089] In one embodiment, the trained text recognition model further includes a recurrent network layer; wherein, after performing feature dimension analysis on the image feature map through the data transformation layer to obtain the image matrix, the model further includes: performing sequence analysis on the image matrix through the recurrent network layer to obtain a target matrix vector; wherein, the target matrix vector is used for character classification through the classification layer.
[0090] In a practical implementation, the trained text recognition model can also be configured with a recurrent neural network (RNN) layer. The input to the RNN layer is M, which is expanded along its first dimension to form a dimension C. f The vector sequence is considered as the temporal sequence of inputs to the RNN layer, and at each time step, the network outputs a vector with dimension C. r The vectors are ultimately transformed into a new column vector R, with a column length of 2*W. f .
[0091] In one embodiment, the data transformation layer includes a dimension splitting network, a dimension exchange network, and a dimension merging network; wherein, the data transformation layer performs feature dimension analysis on the image feature map to obtain an image matrix, including: splitting the image feature map into dimensions using the dimension splitting network to obtain split image feature maps; exchanging the dimensions of the split image feature maps using the dimension exchange network to obtain exchanged image feature maps; and merging the dimensions of the exchanged image feature maps using the dimension merging network to obtain an image matrix.
[0092] In practice, dimension swapping changes the order of elements but not the total number of dimensions. For example, swapping two dimensions of a 2x3 matrix [[1,2,3],[4,5,6]] results in a 3x2 matrix [[1,4],[2,5],[3,6]]. Dimension splitting and merging do not change the order of elements but increase or decrease the total number of dimensions. For example, merging adjacent dimensions into one dimension. For instance, merging two dimensions of a 3x2 matrix [[1,4],[2,5],[3,6]] results in [1,4,2,5,3,6]. Dimension splitting is the opposite. From a shape perspective, the transformation rules are as follows:
[0093]
[0094] Specifically, the left side of the transformation rule represents the three dimensions of the feature map F, and the right side represents the two dimensions [2*W] of the matrix M. f C w Let H f / 2*C f =C w ).
[0095] S204, Based on the character matching probability, determine the text characters of the multi-line text contained in the image to be recognized.
[0096] For specific implementation details, please refer to [link / reference]. Figure 5 By analyzing the character matching probability, such as "0.2, 0.35, 0.57, 0.22...0.9", the maximum probability is selected, and the sequence number corresponding to the maximum probability is selected to lock the mapped character, thus determining the text characters of the multi-line text contained in the image to be recognized.
[0097] The image recognition method for multi-line text in the above embodiments involves a server acquiring an image to be recognized. Responding to the fact that the image contains multi-line text, the server normalizes the target image to obtain a normalized target image. This normalized target image is then input into a trained text recognition model, which outputs character matching probabilities. Based on these probabilities, the text characters of the multi-line text in the image can be determined. The trained text recognition model includes a data transformation layer for feature dimension analysis of the normalized target image. By detecting and recognizing multi-line text as a whole, hard detection of individual text lines in the image is avoided, thereby improving the accuracy of character recognition for multi-line text in an image.
[0098] It should be understood that, although Figure 2 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 2 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0099] To better implement the image recognition method containing multi-line text provided in the embodiments of this application, based on the image recognition method containing multi-line text proposed in the embodiments of this application, the embodiments of this application also provide an image recognition device containing multi-line text, such as... Figure 6 As shown, the image recognition device 600 containing multi-line text includes:
[0100] Image acquisition module 610 is used to acquire the image to be recognized;
[0101] Image processing module 620 is used to normalize the target image in response to the image to be recognized being a target image containing multiple lines of text, so as to obtain a normalized target image;
[0102] The text recognition module 630 is used to input the normalized target image into the trained text recognition model and output the character matching probability; wherein, the trained text recognition model includes a data transformation layer, which is used to perform feature dimension analysis on the normalized target image.
[0103] The character determination module 640 is used to determine the text characters of the multi-line text contained in the image to be recognized based on the character matching probability.
[0104] In one embodiment, the trained text recognition model includes a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer. The text recognition module 630 is further configured to input the normalized target image into the trained text recognition model, extract features from the normalized target image through the feature extraction layer to obtain an image feature map, perform feature dimension analysis on the image feature map through the data transformation layer to obtain an image matrix, classify characters in the image matrix through the classification layer to obtain a character classification vector, and perform loss analysis on the character classification vector through the connectionist temporal classification layer to obtain the character matching probability.
[0105] In one embodiment, the trained text recognition model further includes a recurrent network layer; the text recognition module 630 is also used to perform sequence analysis on the image matrix through the recurrent network layer to obtain a target matrix vector; wherein the target matrix vector is used for character classification through a classification layer.
[0106] In one embodiment, the data transformation layer includes a dimension splitting network, a dimension exchange network, and a dimension merging network; the text recognition module 630 is further configured to perform dimension splitting on the image feature map through the dimension splitting network to obtain the split image feature map; perform dimension exchange on the split image feature map through the dimension exchange network to obtain the exchanged image feature map; and perform dimension merging on the exchanged image feature map through the dimension merging network to obtain the image matrix.
[0107] In one embodiment, the image recognition device 600 containing multi-line text further includes a model training module for constructing an initial text recognition model. The text recognition model consists of a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer. The module acquires a set of multi-line text images and divides it into a training set and a test set. The multi-line text image set includes images of multiple labeled text characters. The text characters are determined by querying a preset character sequence mapping table. The initial text recognition model is trained using the training set to obtain a pre-trained text recognition model. The pre-trained text recognition model is tested and adjusted using the test set to obtain a trained text recognition model.
[0108] In one embodiment, the model training module is further configured to acquire multi-line text images, and annotate the multi-line text images with text characters to obtain annotated multi-line text images as candidate text images; analyze the image format, image size and / or image features of the candidate text images; select candidate text images that meet the preset model training conditions based on at least one of the image format, image size and image features as target text images; perform data augmentation on the target text images to obtain a set of multi-line text images.
[0109] In one embodiment, the image processing module 620 is further configured to invoke a trained text detection model, including the EAST model; input the image to be recognized into the trained text detection model to obtain the model output result; in response to the model output result being a multi-line text rectangle, determine that the image to be recognized is a target image containing multi-line text; and perform normalization processing on the target image based on a preset interpolation method to obtain a normalized target image.
[0110] In the above embodiments, it is proposed to detect and recognize multiple lines of text as a whole, which can avoid hard detection of text lines in the image and thus improve the character recognition accuracy of multiple lines of text contained in the image.
[0111] It should be noted that specific limitations regarding image recognition devices containing multi-line text can be found in the limitations of image recognition methods containing multi-line text described above, and will not be repeated here. Each module in the aforementioned image recognition device containing multi-line text can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the electronic device, or stored in the memory of the electronic device in software form, so that the processor can call and execute the corresponding operations of each module.
[0112] In one embodiment, the image recognition device 600 containing multi-line text can be implemented as a computer program, such as... Figure 7The image recognition device 600, which contains multi-line text, runs on the computer device shown. The computer device's memory can store the various program modules that make up the image recognition device 600, for example, Figure 6 The image acquisition module 610, image processing module 620, text recognition module 630, and character determination module 640 shown; the computer program composed of each program module causes the processor to execute the steps in the image recognition method containing multi-line text described in the various embodiments of this application. For example, Figure 7 The computer device shown can be used as follows Figure 6 The image acquisition module 610 of the image recognition device 600 containing multi-line text executes step S201. The computer device can execute step S202 via the image processing module 620. The computer device can execute step S203 via the text recognition module 630. The computer device can execute step S204 via the character determination module 640. The computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communication with external computer devices via a network connection. When the computer program is executed by the processor, it implements an image recognition method containing multi-line text.
[0113] Those skilled in the art will understand that Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0114] In one embodiment, a computer device is provided, including one or more processors; a memory; and one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor as described in the image recognition method for containing multi-line text. The steps of the image recognition method for containing multi-line text described herein can be steps from the image recognition methods for containing multi-line text described in the above embodiments.
[0115] In one embodiment, a computer-readable storage medium is provided storing a computer program, which is loaded by a processor to cause the processor to perform the steps of the image recognition method containing multi-line text described above. The steps of the image recognition method containing multi-line text described herein can be steps from the image recognition methods containing multi-line text described in the various embodiments above.
[0116] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.
[0117] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0118] The foregoing has provided a detailed description of an image recognition method, apparatus, and computer device containing multi-line text provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. An image recognition method comprising a plurality of lines of text, characterized by, include: Acquire the image to be recognized; Call the trained text detection model; The image to be identified is input into the trained text detection model to obtain the model output result; In response to the model output being a multi-line text rectangle, the image to be recognized is determined to be a target image containing multi-line text. In response to the fact that the image to be identified is a target image containing multiple lines of text, the target image is normalized to obtain a normalized target image; The normalized target image is input into a trained text recognition model, which outputs character matching probabilities; wherein the trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image. Based on the character matching probability, the text characters of the multi-line text contained in the image to be recognized are determined.
2. The method as described in claim 1, characterized in that, The trained text recognition model includes a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer; among which, The step of inputting the normalized target image into the trained text recognition model and outputting the character matching probability includes: The normalized target image is input into the trained text recognition model, and the feature extraction layer extracts features from the normalized target image to obtain an image feature map. The image feature map is analyzed by the data transformation layer to obtain the image matrix; The image matrix is classified using the classification layer to obtain a character classification vector; The character matching probability is obtained by performing loss analysis on the character classification vector through the connectionist temporal classification layer.
3. The method as described in claim 2, characterized in that, The trained text recognition model also includes a recurrent network layer; wherein, After performing feature dimension analysis on the image feature map through the data transformation layer to obtain the image matrix, the method further includes: The image matrix is subjected to sequence analysis by the recurrent network layer to obtain a target matrix vector; wherein the target matrix vector is used for character classification by the classification layer.
4. The method as described in claim 3, characterized in that, The data transformation layer includes a dimension splitting network, a dimension exchange network, and a dimension merging network; wherein... The step of performing feature dimension analysis on the image feature map through the data transformation layer to obtain an image matrix includes: The image feature map is dimensionally split using the dimensionality splitting network to obtain the split image feature map; The dimension-swapping network is used to perform dimension swapping on the split image feature map to obtain the swapped image feature map. The image matrix is obtained by dimensional merging the swapped image feature maps using the dimensional merging network.
5. The method as described in claim 1, characterized in that, Before inputting the normalized target image into the trained text recognition model, the method further includes: An initial text recognition model is constructed; the text recognition model consists of a feature extraction layer, a data transformation layer, a classification layer, and a connectionist temporal classification layer; A multi-line text image set is obtained, and the multi-line text image set is divided into a training set and a test set; the multi-line text image set includes images of multiple labeled text characters; the text characters are determined by querying a preset character sequence mapping table; The initial text recognition model is trained using the training set to obtain a pre-trained text recognition model. The pre-trained text recognition model is tested and adjusted using the test set to obtain the trained text recognition model.
6. The method as described in claim 5, characterized in that, The acquisition of the multi-line text image set includes: A multi-line text image is acquired, and text characters are annotated on the multi-line text image to obtain an annotated multi-line text image, which serves as a candidate text image; Analyze the image format, image size, and / or image features of the candidate text images; Candidate text images that meet the preset model training conditions are selected as target text images based on at least one of the image format, image size, and image features. The target text image is augmented with data, and the multi-line text image set is obtained by statistical analysis.
7. The method as described in claim 1, characterized in that, The trained text detection model includes the EAST model; in response to the image to be identified being a target image containing multi-line text, the target image is normalized to obtain a normalized target image, including: Based on a preset interpolation method, the target image is normalized to obtain the normalized target image.
8. An image recognition device containing multi-line text, characterized in that, include: The image acquisition module is used to acquire the image to be recognized; The image processing module is used to call the trained text detection model; input the image to be recognized into the trained text detection model to obtain the model output result; in response to the model output result being a multi-line text rectangle, determine that the image to be recognized is a target image containing multi-line text; in response to the image to be recognized being a target image containing multi-line text, perform normalization processing on the target image to obtain a normalized target image. The text recognition module is used to input the normalized target image into a trained text recognition model and output the character matching probability; wherein, the trained text recognition model includes a data transformation layer for performing feature dimension analysis on the normalized target image; The character determination module is used to determine the text characters of the multi-line text contained in the image to be recognized based on the character matching probability.
9. A computer device, characterized in that, The computer device includes: One or more processors; The memory; and one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the image recognition method containing multi-line text as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, It stores a computer program, which is loaded by a processor to perform the steps of the image recognition method containing multi-line text as described in any one of claims 1 to 7.