Text recognition method and device, electronic equipment and storage medium

By extracting image frame sequences with the same text content from videos and combining them with multimodal features for text recognition, the problem of low text recognition accuracy in video scenarios is solved, achieving higher text recognition accuracy.

CN115565109BActive Publication Date: 2026-06-19IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2022-10-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The accuracy of text recognition in video scenarios in existing technologies is not high, mainly because the text in dynamic videos may change in size, angle, position, clarity and background, making it difficult to extract high-quality keyframes.

Method used

By extracting text features from image frame sequences with the same text content, and combining positional features, semantic features, and visual features, a pre-trained text recognition model is used to perform text recognition on the image frame sequence. In the recognition process, multimodal information from the previous frame is incorporated to perform text recognition frame by frame.

🎯Benefits of technology

It improves the accuracy of text recognition in video scenarios, reduces interference from image frames with different text content, and increases the amount of information, thereby improving the accuracy of text recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115565109B_ABST
    Figure CN115565109B_ABST
Patent Text Reader

Abstract

This application proposes a text recognition method, apparatus, electronic device, and storage medium that can recognize the text content of multiple image frame sequences with the same text content, reduce interference from other image frames with different text content, and improve the accuracy of text recognition in video scenes. Moreover, when performing text recognition on the current frame, it can combine modal information such as positional features, semantic features, and visual features of the previous frame to increase the amount of information in the text recognition process, thereby further improving the accuracy of text recognition in video scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of text recognition technology, and in particular to a text recognition method, apparatus, electronic device and storage medium. Background Technology

[0002] With the rapid development of text recognition technology, the accuracy of text recognition for images has significantly improved. However, research on text recognition for videos is still in its early stages. The common approach involves extracting keyframes from the video and then using them for text recognition, which results in low accuracy. Therefore, improving the accuracy of text recognition in video scenarios is a crucial technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0003] Based on the above requirements, this application proposes a text recognition method, apparatus, electronic device, and storage medium, which can improve the accuracy of text recognition in video scenarios.

[0004] The technical solution proposed in this application is as follows:

[0005] On the one hand, this application provides a text recognition method, including:

[0006] S1. Extract the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; where N is a positive integer;

[0007] S2. Based on the text features of the Nth image frame, perform text recognition on the (N+1)th image frame in the image frame sequence to obtain the recognized text corresponding to the (N+1)th image frame;

[0008] S3. Let N = N + 1, and repeat steps S1 and S2 until N equals the sequence number of the image frame sequence. Then, determine the recognition text corresponding to the image frame sequence based on the recognition text corresponding to each image frame.

[0009] Furthermore, in the above-described method, if N=1, the method further includes:

[0010] Text recognition is performed on the first image frame in the image frame sequence to obtain the recognized text corresponding to the first image frame.

[0011] Furthermore, in the method described above, the text features of the Nth image frame include at least one of the positional features, semantic features, and visual features of the text in the Nth image frame.

[0012] Furthermore, in the method described above, extracting the text features of the Nth image frame from the Nth image frame in an image frame sequence with the same text content includes:

[0013] Extract feature information from the Nth image frame in a sequence of image frames with the same text content, wherein the feature information includes at least one of the text's positional features, semantic features, and visual features;

[0014] The extracted feature information is fused to obtain the text features of the Nth image frame.

[0015] Furthermore, in the method described above, extracting the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content, and performing text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame to obtain the recognized text corresponding to the (N+1)th image frame, includes:

[0016] A sequence of image frames with the same text content is input into a pre-trained text recognition model, so that the text recognition model extracts the text features of the Nth image frame from the Nth image frame in the image frame sequence, and performs text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, so as to obtain the recognized text corresponding to the (N+1)th image frame.

[0017] Furthermore, in the method described above, determining the recognition text corresponding to the image frame sequence based on the recognition text corresponding to each image frame includes:

[0018] Detect the confidence level of the recognized text corresponding to each image frame;

[0019] The text with the highest confidence level is determined as the text corresponding to the image frame sequence.

[0020] Furthermore, the method described above also includes: the image frame sequence with the same text content is extracted from a video.

[0021] Furthermore, in the method described above, the process of extracting the image frame sequence with the same text content includes:

[0022] Video frames with text content overlap exceeding a set overlap threshold are extracted from the video and used to form a video frame sequence.

[0023] Extract a sequence of image frames with the same text content from the video frame sequence.

[0024] Furthermore, in the above-described method, extracting video frames from the video whose text content overlap is higher than a set overlap threshold includes:

[0025] For each text group in each video frame of the video, perform text recognition to obtain the recognized character corresponding to each text group;

[0026] Traverse the recognition characters corresponding to each text group in adjacent video frames to determine the number of target text groups with the same recognition characters in the adjacent video frames;

[0027] If the number of target text groups in the adjacent video frames reaches a set condition, then the adjacent video frames are determined to be video frames whose text content overlap is higher than a set overlap threshold.

[0028] Furthermore, in the method described above, if the number of target text groups in adjacent video frames reaches a set condition, then the adjacent video frames are determined to be video frames whose text content overlap is higher than a set overlap threshold, including:

[0029] Calculate the ratio of the number of target text groups in the adjacent video frames to the maximum number of text groups in the adjacent video frames;

[0030] If the ratio is greater than a set value, it indicates that the number of target text groups in the adjacent video frames has reached the set condition, and the adjacent video frames are determined to be video frames with a text content overlap higher than the set overlap threshold.

[0031] Furthermore, in the method described above, extracting an image frame sequence with the same text content from the video frame sequence includes:

[0032] Identify text in adjacent video frames whose distance from the target text group is within a set distance threshold range;

[0033] If the text in the adjacent video frames is the same as the target text group at a distance within a set distance threshold, then the region where the target text group is located is extracted from the video frame sequence to form the image frame sequence.

[0034] Furthermore, the method described above also includes: combining the recognized text corresponding to all image frame sequences in the video that have the same text content to obtain the recognized text corresponding to the video.

[0035] On the other hand, this application also provides a text recognition device, including:

[0036] The extraction module is used to perform step S1, extracting the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; where N is a positive integer;

[0037] The recognition module is used to execute step S2, which involves performing text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, to obtain the recognized text corresponding to the (N+1)th image frame.

[0038] The repeat module is used to execute step S3, let N = N + 1, and control the extraction module to repeatedly execute the above step S1 and the recognition module to repeatedly execute the above step S2 until N is equal to the sequence number of the image frame sequence. Then, based on the recognition text corresponding to each image frame, the recognition text corresponding to the image frame sequence is determined.

[0039] On the other hand, this application also provides an electronic device, including:

[0040] Memory and processor;

[0041] The memory is used to store programs;

[0042] The processor is configured to implement any of the above-described text recognition methods by running a program in the memory.

[0043] On the other hand, this application also provides a storage medium, including: a computer program stored on the storage medium, wherein when the computer program is executed by a processor, it implements the text recognition method described in any one of the above.

[0044] The text recognition method proposed in this application recognizes the text content of multiple image frame sequences with the same text content. This reduces interference from other image frames with different text content, thus improving the accuracy of text recognition in video scenes. Furthermore, when recognizing text in the current frame, it can combine modal information such as positional features, semantic features, and visual features of the previous frame to increase the amount of information in the text recognition process, thereby further improving the accuracy of text recognition in video scenes. Attached Figure Description

[0045] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0046] Figure 1 This is a flowchart illustrating a text recognition method provided in an embodiment of this application;

[0047] Figure 2This is a schematic diagram of the process for extracting text features from the Nth image frame according to an embodiment of this application;

[0048] Figure 3 This is a schematic diagram of the processing flow of a text recognition model for a set of training samples provided in an embodiment of this application;

[0049] Figure 4 This is a schematic diagram of the process for extracting image frame sequences from a video according to an embodiment of this application;

[0050] Figure 5 This is a schematic diagram of the process for extracting video frame sequences provided in an embodiment of this application;

[0051] Figure 6 This is a schematic diagram illustrating text recognition of a set character provided in an embodiment of this application;

[0052] Figure 7 This is a schematic diagram of the target text group for verification provided in an embodiment of this application;

[0053] Figure 8 This is a schematic diagram of the structure of a text recognition device provided in an embodiment of this application;

[0054] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0055] Application Overview

[0056] The technical solutions of this application are applicable to application scenarios of text recognition in video, and the use of the technical solutions of this application can improve the accuracy of text recognition in video scenarios.

[0057] In recent years, text recognition technology has developed rapidly, and the accuracy of text recognition in images has significantly improved. For example, using Optical Character Recognition (OCR) technology, text in images can be recognized with high accuracy.

[0058] However, research on text recognition in videos is still in its early stages. Current technologies for recognizing text content in videos typically involve extracting keyframes, transforming the problem of recognizing text content from the video into recognizing text content from keyframes. This is then done using techniques for text recognition applied to images. Therefore, the accuracy of text recognition in videos currently depends heavily on the extraction of high-quality keyframes. In other words, extracting clear and complete keyframes leads to higher accuracy. However, compared to static images, text in dynamic videos may change in size, angle, position, clarity, and background, making it difficult to extract high-quality keyframes and affecting the accuracy of text recognition.

[0059] Based on this, this application proposes a text recognition method, apparatus, electronic device, and storage medium. This technical solution can perform text recognition on multiple image frame sequences with the same text content, thereby improving the accuracy of text recognition in video scenes. Moreover, when performing text recognition on the current frame, it can combine the multimodal information of the previous frame image to increase the amount of information in the text recognition process, further improving the accuracy of text recognition in video scenes.

[0060] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0061] Exemplary methods

[0062] This application proposes a text recognition method, which can be executed by an electronic device. The electronic device can be any device with data and instruction processing capabilities, such as a computer, smart terminal, or server. See also... Figure 1 As shown, the method includes:

[0063] S101. Extract the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content.

[0064] The aforementioned image frame sequence contains multiple image frames, all of which contain the same text content. An image frame can be a complete video frame or a portion of a video frame. A video frame is the smallest unit that makes up a video; a single video frame is a still image, and consecutive video frames form a dynamic video.

[0065] The image frame sequence is extracted from a video containing text content to be identified. In this embodiment, the language and font of the text content to be identified are not limited. For example, the language of the text content to be identified can be Chinese or English, and the font can be a handwritten font or a printed font such as Song, Kai, Hei, or Lishu.

[0066] The electronic device executing the text recognition method of this embodiment can download and obtain a video containing the text content to be recognized from a server; or it can download and obtain a video containing the text content to be recognized from an intermediate storage device. The intermediate storage device can be any device with storage function, such as a USB flash drive, memory card, etc. If the electronic device executing the text recognition method of this embodiment includes a camera, the video containing the text content to be recognized can be obtained through the camera of the electronic device. This embodiment does not impose any limitations.

[0067] After acquiring a video containing the text to be identified, a sequence of image frames with the same text content is extracted from the video. Specifically, in this embodiment, each video frame is analyzed, and a sequence of image frames with the same text content is extracted from the video frames.

[0068] For example, each video frame can be analyzed in units of text groups to determine whether there are text groups with the same text content in different video frames. A text group can be a set length of text content; for example, a text group can include a line of text, a column of text, or a paragraph of text, etc., and this embodiment is not limited to this. If text groups with the same text content exist in the video frames, the regions containing the text groups are extracted from the video frames to form an image frame sequence.

[0069] In determining whether text groups with the same text content exist in different video frames, image features of each text group in each video frame can be extracted. Image features include at least one of color features, texture features, shape features, and spatial relationship features. Text groups with the same image features in different video frames are identified as text groups with the same text content. Image features of each text group in each video frame can be extracted using network models such as VGGNet and ResNet; this embodiment does not impose limitations on this method.

[0070] When determining whether text groups with the same text content exist in different video frames, it is also possible to identify several texts located at the same predetermined position within each text group. The predetermined position and the number of texts at the predetermined position can be set according to actual conditions; this embodiment does not impose any limitations. It should be noted that the number of texts at the predetermined position should be minimized to reduce the workload during text group recognition and improve recognition efficiency. For example, recognizing the first two characters or the last two characters of each text group. Text groups with several identical texts located at the same position are identified as text groups with the same text content.

[0071] Another example is to analyze each frame of the video, first detecting the overlap of text content between video frames. If the overlap of text content in several video frames exceeds a set value, these video frames with overlap exceeding the set value can be grouped into a video frame sequence. Then, an image frame sequence with the same text content can be extracted from the video frame sequence. Compared to extracting text groups with the same text content from all video frames, first extracting video frame sequences from all video frames and then extracting image frame sequences from the video frame sequence reduces the workload of text recognition and thus improves the overall recognition efficiency.

[0072] In this embodiment, after extracting the image frame sequence with the same text content, text recognition is performed on the image frame sequence. Specifically, in this embodiment, the text features of the Nth image frame are extracted from the Nth image frame in the image frame sequence with the same text content, where N is a positive integer.

[0073] The aforementioned text features include at least one of the positional features, semantic features, and visual features of the text in the Nth image frame. Extracting positional features, semantic features, and visual features from images using neural network models is a well-established and mature technique in this field. Those skilled in the art can refer to existing descriptions to extract the text features of the Nth image frame from a sequence of image frames with the same text content.

[0074] For example, such as Figure 2As shown, the text content in the Nth image frame A is handwritten Chinese. If the text features of the Nth image frame A include the position feature, semantic feature, and visual feature of the text in the Nth image frame A, the position of each character in the Nth image frame A can be detected through a character detection model, and then the position feature of the text in the Nth image frame A can be extracted through position embedding based on the position of each character in the Nth image frame A; the output of the backbone of the above character detection model is used as the visual feature of the text in the Nth image frame A; the Nth image frame A is sent into a recognition model to obtain the recognition result "I am Xiaoming" of the recognition model, and the recognition result "I am Xiaoming" of the recognition model is sent into the BERT model to obtain the semantic feature of the text in the Nth image frame A output by the BERT model.

[0075] An image containing text content can be used as a training sample, and the position of each character in the training sample can be used as a label to train the character detection model. When the loss value of the character detection model is less than the set value during training, the training of the character detection model is completed. An image containing text content can be used as a training sample, and the recognized text corresponding to the text content in the training sample can be used as a label to train the recognition model. When the loss value of the text recognition model is less than the set value during training, the training of the recognition model is completed.

[0076] It should be noted that if the text features of the Nth image frame include the position feature, semantic feature, and visual feature of the text in the Nth image frame, the position feature, semantic feature, and visual feature can be used as text features, or as Figure 2 shown, first perform fusion processing on the position feature, semantic feature, and visual feature, and use the feature after the fusion processing as the text feature. Similarly, if the text features of the Nth image frame only include two of the position feature, semantic feature, and visual feature of the text, then these two text features can be determined as text features, or first perform fusion processing on these two text features, and use the feature after the fusion processing as the text feature.

[0077] In this embodiment, the initial value of N is 1, that is, in this embodiment, the text features of the first image frame are first extracted from the first image frame in the image frame sequence with the same text content.

[0078] S102. Based on the text features of the Nth image frame, perform text recognition on the (N + 1)th image frame in the image frame sequence to obtain the recognized text corresponding to the (N + 1)th image frame.

[0079] Based on the above steps, after extracting the text features of the Nth image frame in the image frame sequence, text recognition can be performed on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, so as to combine the text features of the Nth image frame to determine the recognition text corresponding to the (N+1)th image frame.

[0080] The text to be recognized for the (N+1)th image frame can be extracted using a pre-trained text content recognition model. Specifically, the text features of the Nth image frame and the (N+1)th image frame are input into the text content recognition model, enabling the model to perform text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, thus obtaining the text to be recognized for the (N+1)th image frame output by the text content recognition model.

[0081] The training samples for the text content recognition model are the (n+1)th sample image frame in the sample image frame sequence, and the text features of the nth sample image frame. The label is the recognized text corresponding to the nth sample image frame. When training the text content recognition model, the training samples are input into the model. Based on the recognition results and labels, the cross-entropy loss of the text content recognition model is calculated. The backpropagation (BP) algorithm is used to update the model parameters. This training process is repeated until the cross-entropy loss of the text content recognition model is less than a set value, at which point the text content recognition model training is complete.

[0082] For example, if N is 1, after extracting the text features of the first image frame, the text recognition of the second image frame is performed based on the text features of the first image frame and the second image frame in the image frame sequence to obtain the recognized text corresponding to the second image frame.

[0083] S103. Detect whether N+1 is equal to the number of image frame sequences; if N+1 is less than the number of image frame sequences, let N = N+1 and repeat step S101; if N+1 is equal to the number of image frame sequences, then execute step S104.

[0084] The sequence number of an image frame sequence is the number of image frames contained in the sequence. For example, if the sequence number of an image frame sequence is 20, it means that the image frame sequence contains 20 image frames.

[0085] In this step, it is checked whether the value of N+1 is equal to the number of image frame sequences. If the value of N+1 is less than the number of image frame sequences, then N = N+1, and step S101 is repeated. If the value of N+1 is equal to the number of image frame sequences, then step S104 is executed.

[0086] For example, if the number of image frame sequences is 20 and N is 1, in this embodiment, the text features of the first image frame are extracted. After the text features of the first image frame are extracted, the text recognition of the second image frame is performed based on the text features of the first image frame and the second image frame in the image frame sequence to obtain the recognized text corresponding to the second image frame.

[0087] At this time, N takes the value of 1, and N+1 takes the value of 2. Since the value of N+1 is less than the number of sequences in the image frame sequence, let N = 2 and repeat step S101 to extract the text features of the second image frame. Based on the text features of the second image frame and the third image frame in the image frame sequence, perform text recognition on the third image frame to obtain the recognized text corresponding to the third image frame.

[0088] At this time, N takes the value of 2, and N+1 takes the value of 3. Since the value of N+1 is less than the number of sequences in the image frame sequence, let N = 3 and repeat step S101 to extract the text features of the third image frame. Based on the text features of the third image frame and the fourth image frame in the image frame sequence, perform text recognition on the fourth image frame to obtain the recognized text corresponding to the fourth image frame.

[0089] At this point, N is 3 and N+1 is 4. Since the value of N+1 is less than the number of sequences in the image frame sequence, let N = 4 and repeat step S101. This process is repeated in a loop.

[0090] The above steps are repeated until N is 18. When N+1 is 19, the value of N+1 is less than the number of sequences in the image frame sequence. Then, N = 19 and step S101 is repeated to extract the text features of the nineteenth image frame. Based on the text features of the nineteenth image frame and the twentieth image frame in the image frame sequence, text recognition is performed on the twentieth image frame to obtain the recognized text corresponding to the twentieth image frame.

[0091] At this point, N is 19 and N+1 is 20. The value of N+1 is equal to the sequence number of the image frame sequence, which is 20. The image frame recognition in the image frame sequence is completed, and step S104 is executed.

[0092] It should be noted that when recognizing the first image frame in the image frame sequence, i.e., N=1, the accuracy may be low since text features have not yet been extracted. Therefore, when N=1, text recognition can be omitted, and only the text features of the first image frame can be extracted to assist in text recognition of the second image frame. Alternatively, text recognition can be performed only on the first image frame to obtain the recognized text corresponding to the first image frame; this embodiment does not impose any limitations.

[0093] S104. Determine the recognition text corresponding to the image frame sequence based on the recognition text corresponding to each image frame.

[0094] In the embodiments of this application, the recognition text corresponding to the image frame sequence is determined from the recognition text corresponding to all image frames in the image frame sequence.

[0095] Specifically, after inputting the text features of the Nth image frame and the (N+1)th image frame into the text content recognition model, the model outputs not only the recognized text corresponding to the (N+1)th image frame, but also the confidence level of that recognized text. In this embodiment, the confidence level of the recognized text corresponding to each image frame can be detected, and the recognized text with the highest confidence level is taken as the recognized text corresponding to the image frame sequence.

[0096] In one embodiment, text recognition is performed on the first image frame. For example, the recognition model described above is used to perform text recognition on the first image frame. The recognition model outputs not only the recognized text corresponding to the first image frame but also the confidence level of that recognized text. In another embodiment, text recognition may not be performed on the first image frame, in which case the confidence level of the recognized text for the first image frame is zero.

[0097] By concatenating the recognized text corresponding to all the video frame sequences in the video, the recognized text in the video can be obtained.

[0098] In the above embodiments, text content recognition is performed on multiple image frame sequences with the same text content. This reduces interference from other image frames with different text content, improving the accuracy of text recognition in video scenes. Furthermore, when performing text recognition on the current frame, modal information such as positional features, semantic features, and visual features of the previous frame can be combined to increase the amount of information in the text recognition process, thereby further improving the accuracy of text recognition in video scenes.

[0099] As an optional implementation, another embodiment of this application discloses that, if N=1, the text recognition method of the above embodiments may further include the following steps:

[0100] Text recognition is performed on the first image frame in the image frame sequence to obtain the recognized text corresponding to the first image frame.

[0101] In this embodiment, text recognition is performed on the first image frame. For example, the recognition model of the above embodiment can be used to perform text recognition on the first image frame to obtain the recognized text corresponding to the first image frame; alternatively, mature OCR technology in the prior art can be used to perform text recognition on the first image frame to obtain the recognized text corresponding to the first image frame. This embodiment does not limit the scope of the method.

[0102] In the embodiments of this application, text recognition is performed on the first image frame to obtain the recognized text, which can effectively avoid missing information in the first image frame during the process of text recognition of the image frame sequence.

[0103] As an optional implementation, another embodiment of this application discloses that the steps of the above embodiments to extract the text features of the Nth image frame from the Nth image frame in an image frame sequence having the same text content include:

[0104] Extract feature information from the Nth image frame in a sequence of image frames with the same text content. The feature information includes at least one of the text's positional features, semantic features, and visual features. Fuse the extracted feature information to obtain the text features of the Nth image frame.

[0105] In the embodiments of this application, at least one of the positional features, semantic features, and visual features of the text is extracted from the Nth image frame as feature information. Then, the feature information extracted from the Nth image frame is fused, and the fused feature obtained after the fusion process is determined as the text feature of the Nth image frame.

[0106] Feature information can be fused by splicing or by direct addition; this embodiment does not limit the method.

[0107] In the embodiments of this application, at least one of the positional features, semantic features, and visual features of the text extracted from the Nth image frame is used to assist in the recognition of the text in the N+1th image frame, which can increase the amount of information in the text recognition process and further improve the accuracy of text recognition in video scenes.

[0108] As an optional implementation, another embodiment of this application discloses that the steps of the above embodiments, namely, extracting the text features of the Nth image frame from the Nth image frame in an image frame sequence having the same text content, and performing text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame to obtain the recognized text corresponding to the (N+1)th image frame, include:

[0109] A sequence of image frames with the same text content is input into a pre-trained text recognition model. The text recognition model extracts the text features of the Nth image frame from the Nth image frame in the image frame sequence, and performs text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, so as to obtain the recognized text corresponding to the (N+1)th image frame.

[0110] Specifically, the embodiments of this application include a pre-trained text recognition model. The text recognition model can be trained using an Encoder-Decoder framework as its base. The training samples for the text recognition model are sequences of sample image frames with the same text content, and the labels are the recognized text for each sample image frame in the sequence. During training, the training samples are input into the text recognition model, which uses an autoregressive approach to recognize the text corresponding to each sample image frame.

[0111] Figure 3 The diagram illustrates the processing flow of a text recognition model for a set of training samples. Specifically, as shown... Figure 3 As shown, the text recognition model includes a text recognition network and a feature extraction network. The text recognition network identifies the text in the first sample image frame of the sequence as its output, and the feature extraction network extracts the text features of the text in the first sample image frame. Based on the text features of the first sample image frame, the text recognition network identifies the text in the second sample image frame of the sequence, obtaining the text in the second sample image frame as its output, and the feature extraction network extracts the text features of the text in the second sample image frame. Based on the text features of the second sample image frame, the text recognition network identifies the text in the third sample image frame of the sequence, obtaining the text in the third sample image frame as its output, and the feature extraction network extracts the text features of the text in the third sample image frame. This process is repeated until the text recognition network identifies the text in the second-to-last sample image frame of the sequence, obtaining the text in the last sample image frame as its output.

[0112] The cross-entropy loss is calculated based on the label of each sample image frame and the corresponding output result. Then, with the aim of reducing the cross-entropy loss, the parameters of the text recognition network and / or feature extraction network are updated using the BP algorithm.

[0113] Repeat the above training steps until the cross-entropy loss of the text recognition model is less than the set value, at which point the text recognition model training is complete.

[0114] In this embodiment, an image frame sequence with the same text content is input into a pre-trained text recognition model. The pre-trained text recognition model can recognize and output the text corresponding to each image frame in the image frame sequence.

[0115] In the above embodiments, the pre-trained text recognition model is used to recognize each image frame in the image frame sequence. This not only results in fast recognition speed, but also in higher accuracy of the text recognition model as the number of training samples set during the training process increases.

[0116] As an optional implementation, another embodiment of this application discloses that the steps of the above embodiments, based on the recognition text corresponding to each image frame, determine the recognition text corresponding to the image frame sequence, including:

[0117] Detect the confidence level of the recognized text corresponding to each image frame; determine the recognized text with the highest confidence level as the recognized text corresponding to the image frame sequence.

[0118] In this embodiment, the confidence level of the recognition text corresponding to each image frame can be detected, and the recognition text with the highest confidence level can be used as the recognition text corresponding to the image frame sequence, which can improve the accuracy of text content recognition in the image frame sequence.

[0119] As an optional implementation, another embodiment of this application discloses that the image frame sequence with the same text content in the above embodiments is extracted from video.

[0120] Specifically, the aforementioned video refers to a video containing text content to be identified. This embodiment analyzes each frame of the video, extracting a sequence of image frames with the same text content to facilitate text recognition of these image frame sequences. Recognizing text content from multiple image frame sequences with identical text content reduces interference from other image frames with different text content, thus improving the accuracy of text recognition in video scenarios.

[0121] As an optional implementation method, such as Figure 4 As shown, another embodiment of this application discloses a process for extracting image frame sequences with the same text content, specifically including the following steps:

[0122] S401. Extract video frames from the video whose text content overlap is higher than a set overlap threshold, and form a video frame sequence.

[0123] In this embodiment, each frame in the video can be analyzed first to detect the overlap of text content between video frames. If the overlap of text content in several video frames is higher than a set overlap threshold, then these video frames with text content overlap exceeding the set overlap threshold can be combined into a video frame sequence.

[0124] For example, image features of video frames can be extracted, and the number of identical image features in different video frames can be detected. If the number of identical image features in several different video frames is higher than a set overlap threshold, then the aforementioned several different video frames are combined into a video frame sequence. The aforementioned set overlap threshold can be set according to actual conditions, and this embodiment does not limit it.

[0125] Another example is that each video frame can be analyzed on a text-group basis to determine whether there are text groups with the same text content in different video frames. If several video frames contain text groups with the same text content, it indicates that these several video frames have overlapping content, which is the aforementioned text groups with the same text content. The ratio of the overlapping content to all the content in each of the aforementioned several video frames is calculated as the overlap degree. Video frames with an overlap degree higher than a set overlap degree threshold are grouped into a video frame sequence. The aforementioned set overlap degree threshold can be set according to actual conditions, and this embodiment does not limit it.

[0126] It should be noted that if there is a video frame whose text content overlaps with other video frames in the video by less than a set overlap threshold, then such a video frame can be used to form a separate video frame sequence, or such a video frame can be discarded.

[0127] S402. Extract a sequence of image frames with the same text content from a video frame sequence.

[0128] Extract regions containing text groups with the same text content from a video frame sequence to form an image frame sequence with the same text content.

[0129] It should be noted that if there exists an image frame whose text group is different from that of other image frames, then such an image frame can be used to form a separate video frame sequence, or it can be discarded. If such an image frame is chosen to be used to form a separate video frame sequence, then when recognizing that image frame, only text recognition is performed on that image frame to obtain the corresponding recognized text, and this recognized text is used as the recognized text for the video frame sequence.

[0130] In this embodiment, the video frame sequence is first extracted from all video frames, and then the image frame sequence is extracted from the video frame sequence. This reduces the workload of text recognition and improves the overall recognition efficiency.

[0131] As an optional implementation method, such as Figure 5 As shown in another embodiment of this application, the steps of the above embodiments to extract video frames from which the text content overlap is higher than a set overlap threshold may specifically include the following steps:

[0132] S501. Perform text recognition on the set characters of each text group in each video frame in the video to obtain the recognized characters corresponding to each text group.

[0133] The aforementioned "set characters" refer to characters with set positions and a set number. The set positions and numbers can be set according to actual conditions, and this embodiment does not impose any limitations. For example, recognizing the first two characters of each text group or recognizing the last two characters of each text group, etc. In this embodiment, text recognition is performed on the set characters of each text group in each video frame. Text recognition can use the recognition model of the above embodiment, or it can use OCR technology for text recognition; this embodiment does not impose any limitations, obtaining the recognized characters corresponding to each text group.

[0134] For example, Figure 6 The content shown includes video frame 61. If each line is considered a text group, and the characters are defined as the first two characters of each line, then text recognition is performed on the first two characters of each line in video frame 61 to obtain... Figure 6 The identified text is 62.

[0135] S502. Traverse the recognition characters corresponding to each text group in adjacent video frames to determine the number of target text groups with the same recognition characters in adjacent video frames.

[0136] Because videos are continuous, the overlap of text content between adjacent video frames is more likely to exceed a set overlap threshold. Therefore, this embodiment iterates through the recognized characters corresponding to each text group in adjacent video frames to determine the number of target text groups with the same recognized characters in adjacent video frames.

[0137] by Figure 6 The illustrated embodiment is an example. Figure 6 The first and second frames are adjacent video frames, and the second and third frames are adjacent video frames. This is achieved by traversing... Figure 6 The corresponding characters in the text groups of the first and second frames can be used to determine... Figure 6 The number of target text groups with identical characters between the first and second frames is 3; by traversing... Figure 6 The recognized characters corresponding to the text groups in the second and third frames can be determined. Figure 6 The number of target text groups with the same characters identified between the second and third frames is 0.

[0138] S503. If the number of target text groups in adjacent video frames reaches the set condition, then the adjacent video frames are determined to be video frames whose text content overlap is higher than the set overlap threshold.

[0139] If the number of target text groups in adjacent video frames meets the set conditions, then the adjacent video frames can be determined to be video frames with a text content overlap higher than the set overlap threshold. The above-mentioned set conditions can be set according to the actual situation. For example, it can be set that the number of target text groups reaches more than a set percentage of the total number of text groups in each of the adjacent video frames, such as more than 90% of the total number of text groups in each of the adjacent video frames; it can also be set that the area occupied by the target text groups reaches more than a set percentage of the total area of ​​the text in each of the adjacent video frames, such as more than 90% of the total area of ​​the text in each of the adjacent video frames. This embodiment does not limit this.

[0140] For example, if the number of target text groups reaches more than 90% of the total number of text groups in each of the adjacent video frames, then the adjacent video frames are determined to be video frames whose text content overlap is higher than the set overlap threshold. Figure 6 Taking the illustrated embodiment as an example, the number of target text groups with the same characters identified between the first frame and the second frame is 3. The ratio of the number of target text groups to the number of text groups in the first frame is 100%, and the ratio of the number of target text groups to the number of text groups in the second frame is 100%. Both are greater than 90%, therefore, it is determined that... Figure 6 The first and second frames in the video are video frames where the text content overlap is higher than a set overlap threshold; the number of target text groups with the same characters between the second and third frames is 0, the ratio of the number of target text groups to the number of text groups in the second frame is 0, and the ratio of the number of target text groups to the number of text groups in the third frame is 0, all less than 90%, therefore, it is determined that... Figure 6 The second and third frames in the video are not video frames whose text content overlap is higher than the set overlap threshold.

[0141] In the above embodiments, it is only necessary to identify a set number of characters in each text group to determine whether adjacent video frames are video frames whose text content overlap is higher than a set overlap threshold. It is not necessary to identify all characters in the video frame, so the recognition speed is fast and the efficiency is high.

[0142] As an optional implementation, another embodiment of this application discloses that if the number of target text groups in adjacent video frames reaches a set condition, the adjacent video frames are determined to be video frames with a text content overlap higher than a set overlap threshold. Specifically, this may include the following steps:

[0143] Calculate the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames; if the ratio is greater than a set value, it means that the number of target text groups in adjacent video frames has reached the set condition, and the adjacent video frames are determined to be video frames with text content overlap higher than the set overlap threshold.

[0144] In this embodiment, only the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames is calculated, because the ratio of the number of target text groups in adjacent video frames to the minimum number of text groups in adjacent video frames is always greater than the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames. If the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames is greater than a set value, it can be determined that the number of target text groups in adjacent video frames has reached the set condition.

[0145] For example, if the number of target text groups in adjacent video frames is 5, one video frame contains 5 text groups, another contains 6 text groups, and the maximum number of text groups in adjacent video frames is 5, the ratio of the number of target text groups (5) in adjacent video frames to the maximum number of text groups (6) in adjacent video frames is calculated, resulting in 0.83. If 0.83 is greater than a set value, it can be determined that the number of target text groups in adjacent video frames meets the set condition. For example, if the set value is 0.7, it indicates that the number of target text groups in adjacent video frames meets the set condition, and the adjacent video frames are determined to be video frames whose text content overlap is higher than the set overlap threshold.

[0146] In the above embodiments, only the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames is calculated. Based on this ratio, it is determined whether the number of target text groups in adjacent video frames meets the set conditions, which can reduce the amount of calculation and improve the overall recognition speed.

[0147] As an optional implementation, another embodiment of this application discloses that the steps of the above embodiments to extract image frame sequences with the same text content from a video frame sequence may specifically include the following steps:

[0148] The system identifies text in adjacent video frames that is within a set distance threshold range from the target text group. If the text in adjacent video frames that is within the set distance threshold range from the target text group is the same, the system extracts the region containing the target text group from the video frame sequence to form an image frame sequence.

[0149] Specifically, in the above embodiments, only the set characters in the text group are recognized. If the set characters are the same, the recognized text groups are considered to be the same text groups. However, in the actual recognition process, there are special cases where the set characters are the same, but the recognized text groups are not the same. If the image frame sequence generated based on such text groups is used for text recognition, it is difficult to obtain the correct recognition result.

[0150] To avoid the above-mentioned situations affecting the recognition results, this embodiment adds a step to verify the target text group. If the target text group in adjacent video frames is the same, then the text in adjacent video frames that is within the set distance threshold range from the target text group must also be the same.

[0151] Based on this, the verification process is to detect whether the texts in adjacent video frames that are within the distance of the target text group are the same. If the texts in adjacent video frames that are within the distance of the target text group are the same, then the target text group determined in the above embodiment can be determined to be the same text group. The region where the target text group is located can be extracted from the video frame sequence to form an image frame sequence.

[0152] The distance mentioned above can be Euclidean distance. The distance threshold range and the number of texts within the distance threshold range can be determined according to the actual situation. For example, the distance threshold range can be a specific distance range, or the texts closest to or farthest from the target text group can be detected. The number of texts within the distance threshold range can be one or more, which is not limited in this embodiment.

[0153] In a specific real-time scenario, the three texts closest to the target text group in adjacent video frames are detected. If the closest texts to the target text group in adjacent video frames are the same, the second closest texts are the same, and the third closest texts are also the same, then the target texts in adjacent video frames are determined to be the same. The region where the target text group is located is extracted from the video frame sequence and formed into an image frame sequence.

[0154] So if Figure 7 As shown, in two adjacent video frames, the three texts closest to the target text group in the first frame are "no", "this", and "water", and the three texts closest to the target text group in the second frame are also "no", "this", and "water". Therefore, it can be determined that the target text group in the first frame and the second frame are the same text group.

[0155] In the above embodiments, the addition of a step to verify the target text group can avoid situations where text groups have the same set characters, but the recognized text groups are not the same, thereby improving the accuracy of video frame sequence recognition.

[0156] As an optional implementation, another embodiment of this application discloses the text recognition method of the above embodiments, which may specifically include the following steps:

[0157] The recognized text corresponding to all image frame sequences with the same text content in the video is combined to obtain the recognized text corresponding to the video.

[0158] Specifically, the recognized text corresponding to the image frame sequences extracted from the same video frame sequence is arranged according to the writing order of the text content in the video. The recognized text corresponding to the image frame sequences extracted from different video frame sequences is arranged according to the shooting time order of the video to obtain the recognized text corresponding to the video.

[0159] The order in which the text content is written can be input by the user or obtained through semantic recognition detection; this embodiment does not impose any limitations on this.

[0160] In the above embodiments, the text content of multiple image frame sequences with the same text content is identified, and then the identified text is combined together, which can improve the accuracy of text recognition in video scenes.

[0161] Furthermore, the technical solution of this embodiment can also recognize text content on curved surfaces, such as instructions and ingredient lists on bottles. Users can record videos containing text content on curved surfaces, for example, first recording a video of the text content on the left side, then recording a video of the text content on the right side, thus obtaining a video containing the text content. Then, using the text recognition method of this embodiment, the recognized text corresponding to the video can be obtained.

[0162] Exemplary device

[0163] Corresponding to the above-described text recognition method, this application also discloses a text recognition device, see [link to relevant documentation]. Figure 8 As shown, the device includes:

[0164] The extraction module 100 is used to perform step S1, extracting the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; where N is a positive integer;

[0165] The recognition module 110 is used to perform step S2, which is to perform text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, and obtain the recognized text corresponding to the (N+1)th image frame.

[0166] The repeat module 120 is used to execute step S3, let N = N + 1, and control the extraction module to repeatedly execute the above step S1 and the recognition module to repeatedly execute the above step S2 until N is equal to the sequence number of the image frame sequence. Then, based on the recognition text corresponding to each image frame, the recognition text corresponding to the image frame sequence is determined.

[0167] As an optional implementation, another embodiment of this application discloses that, if N=1, the text recognition device of the above embodiments further includes:

[0168] The text recognition module is used to perform text recognition on the first image frame in the image frame sequence to obtain the recognized text corresponding to the first image frame.

[0169] As an optional implementation, another embodiment of this application discloses that the text features of the Nth image frame include at least one of the positional features, semantic features, and visual features of the text in the Nth image frame.

[0170] As an optional implementation, another embodiment of this application discloses an extraction module 100, comprising:

[0171] The first extraction unit is used to extract feature information from the Nth image frame in the image frame sequence with the same text content. The feature information includes at least one of the text's positional features, semantic features, and visual features.

[0172] The fusion unit is used to fuse the extracted feature information to obtain the text features of the Nth image frame.

[0173] As an optional implementation, another embodiment of this application discloses that when the extraction module 100 extracts the text features of the Nth image frame from the Nth image frame in an image frame sequence with the same text content, and the recognition module 110 performs text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame to obtain the recognized text corresponding to the (N+1)th image frame, it is specifically used for:

[0174] A sequence of image frames with the same text content is input into a pre-trained text recognition model. The text recognition model extracts the text features of the Nth image frame from the Nth image frame in the image frame sequence, and performs text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, so as to obtain the recognized text corresponding to the (N+1)th image frame.

[0175] As an optional implementation, another embodiment of this application discloses a repeating module 120, comprising:

[0176] The detection unit is used to detect the confidence level of the recognized text corresponding to each image frame;

[0177] The determining unit is used to determine the recognition text with the highest confidence level as the recognition text corresponding to the image frame sequence.

[0178] As an optional implementation, another embodiment of this application discloses that the text recognition device of the above embodiments further includes:

[0179] The image frame sequence extraction module is used to extract image frame sequences with the same text content from a video.

[0180] As an optional implementation, another embodiment of this application discloses the image frame sequence extraction module of the above embodiments, which includes:

[0181] The second extraction unit is used to extract video frames from the video whose text content overlap is higher than a set overlap threshold, and form a video frame sequence.

[0182] The third extraction unit is used to extract a sequence of image frames with the same text content from a video frame sequence.

[0183] As an optional implementation, another embodiment of this application discloses that when the second extraction unit of the above embodiments extracts video frames from the video whose text content overlap is higher than a set overlap threshold, it is specifically used for:

[0184] Text recognition is performed on the set characters of each text group in each video frame to obtain the recognized characters corresponding to each text group; the recognized characters corresponding to each text group in adjacent video frames are traversed to determine the number of target text groups with the same recognized characters in adjacent video frames; if the number of target text groups in adjacent video frames reaches the set condition, the adjacent video frames are determined to be video frames with text content overlap higher than the set overlap threshold.

[0185] As an optional implementation, another embodiment of this application discloses that when the second extraction unit of the above embodiments determines that the adjacent video frames are video frames with a text content overlap higher than a set overlap threshold if the number of target text groups in adjacent video frames reaches a set condition, it is specifically used for:

[0186] Calculate the ratio of the number of target text groups in adjacent video frames to the maximum number of text groups in adjacent video frames; if the ratio is greater than a set value, it means that the number of target text groups in adjacent video frames has reached the set condition, and the adjacent video frames are determined to be video frames with text content overlap higher than the set overlap threshold.

[0187] As an optional implementation, another embodiment of this application discloses that when the third extraction unit of the above embodiments extracts an image frame sequence with the same text content from a video frame sequence, it is specifically used for:

[0188] The system identifies text in adjacent video frames that is within a set distance threshold range from the target text group. If the text in adjacent video frames that is within the set distance threshold range from the target text group is the same, the system extracts the region containing the target text group from the video frame sequence to form an image frame sequence.

[0189] As an optional implementation, another embodiment of this application discloses that the text recognition device of the above embodiments further includes:

[0190] The combination module is used to combine the recognition text corresponding to all image frame sequences with the same text content in the video to obtain the recognition text corresponding to the video.

[0191] For details on the specific operation of each unit of the aforementioned text recognition device, please refer to the above method embodiments; they will not be repeated here.

[0192] Exemplary electronic devices, computer products, and storage media

[0193] Corresponding to the above text recognition method, this application also discloses an electronic device, see [link to relevant documentation]. Figure 9 As shown, the electronic device includes:

[0194] Memory 200 and processor 210;

[0195] The memory 200 is connected to the processor 210 and is used to store programs;

[0196] The processor 210 is configured to implement the text recognition method disclosed in any of the above embodiments by running a program stored in the memory 200.

[0197] Specifically, the aforementioned electronic device may also include: a bus, a communication interface 220, an input device 230, and an output device 240.

[0198] The processor 210, memory 200, communication interface 220, input device 230, and output device 240 are interconnected via a bus. Among them:

[0199] A bus can include a pathway for transmitting information between various components of a computer system.

[0200] The processor 210 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., or an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the program of the present application. It can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0201] Processor 210 may include a main processor, as well as a baseband chip, modem, etc.

[0202] The memory 200 stores a program for executing the technical solution of this application, and may also store an operating system and other critical business functions. Specifically, the program may include program code, which includes computer operation instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices capable of storing static information and instructions, random access memory (RAM), other types of dynamic storage devices capable of storing information and instructions, disk storage, flash memory, etc.

[0203] Input device 230 may include a device for receiving user input data and information, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor.

[0204] Output device 240 may include devices that allow information to be output to a user, such as a display screen, printer, speaker, etc.

[0205] The communication interface 220 may include a device that uses any transceiver to communicate with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

[0206] The processor 210 executes the program stored in the memory 200 and calls other devices, which can be used to implement the various steps of the text recognition method provided in the above embodiments of this application.

[0207] In addition to the methods and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by processor 210, cause processor 210 to perform the various steps of the text recognition method provided in the above embodiments.

[0208] Computer program products can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0209] Furthermore, embodiments of this application may also be computer-readable storage media storing computer program instructions thereon, which, when executed by a processor, cause the processor 210 to perform the various steps of the text recognition method provided in the above embodiments.

[0210] Computer-readable storage media may take the form of any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may, for example, include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0211] Specifically, the specific working content of each part of the aforementioned electronic device, computer program product, and storage medium, as well as the specific processing content of the computer program product or the computer program on the aforementioned storage medium when run by the processor, can all be found in the various embodiments of the aforementioned text recognition method, and will not be repeated here.

[0212] For the foregoing method embodiments, in order to simplify the description, they are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0213] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0214] The steps in the methods of the various embodiments of this application can be adjusted, merged, or deleted in order according to actual needs, and the technical features described in each embodiment can be replaced or combined.

[0215] The modules and sub-modules in the various embodiments of the present application's devices and terminals can be merged, divided, and deleted according to actual needs.

[0216] It should be understood that the disclosed terminals, devices, and methods can be implemented in other ways, given the several embodiments provided in this application. For example, the terminal embodiments described above are merely illustrative. For instance, the division of modules or sub-modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.

[0217] The modules or submodules described as separate components may or may not be physically separate. The components that constitute a module or submodule may or may not be physical modules or submodules; that is, they may be located in one place or distributed across multiple network modules or submodules. Some or all of the modules or submodules can be selected to achieve the purpose of this embodiment's solution, depending on actual needs.

[0218] Furthermore, the functional modules or sub-modules in the various embodiments of this application can be integrated into one processing module, or each module or sub-module can exist physically separately, or two or more modules or sub-modules can be integrated into one module. The integrated modules or sub-modules described above can be implemented in hardware or in the form of software functional modules or sub-modules.

[0219] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0220] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software unit executed by a processor, or a combination of both. The software unit can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0221] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0222] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text recognition method, characterized in that, include: S1. Extract the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; Where N is a positive integer; The extraction process of the image frame sequence with the same text content includes: Extract video frames from the video whose text content overlaps with a set overlap threshold, and form a video frame sequence; The system identifies text in adjacent video frames whose distance from the target text group is within a set distance threshold range. The target text group is a group of texts in the adjacent video frames that have the same identified characters. The identified characters of each video frame in the adjacent video frames are obtained by performing text recognition on the set characters of each text group in the video frame. The text group is a text of a preset length in the text content of each video frame. The set characters are characters at a preset position and a preset number. If the text in the adjacent video frames is the same as the target text group at a distance within a set distance threshold, then the region where the target text group is located is extracted from the video frame sequence to form the image frame sequence; S2. Based on the text features of the Nth image frame, perform text recognition on the (N+1)th image frame in the image frame sequence to obtain the recognized text corresponding to the (N+1)th image frame; S3. Let N = N + 1, and repeat steps S1 and S2 until N equals the sequence number of the image frame sequence. Then, determine the recognition text corresponding to the image frame sequence based on the recognition text corresponding to each image frame.

2. The method according to claim 1, characterized in that, If N=1, the method further includes: Text recognition is performed on the first image frame in the image frame sequence to obtain the recognized text corresponding to the first image frame.

3. The method according to claim 1, characterized in that, The text features of the Nth image frame include at least one of the positional features, semantic features, and visual features of the text in the Nth image frame.

4. The method according to claim 3, characterized in that, The text features of the Nth image frame are extracted from the Nth image frame in a sequence of image frames with the same text content, including: Extract feature information from the Nth image frame in a sequence of image frames with the same text content, wherein the feature information includes at least one of the text's positional features, semantic features, and visual features; The extracted feature information is fused to obtain the text features of the Nth image frame.

5. The method according to claim 1, characterized in that, The text features of the Nth image frame are extracted from the Nth image frame in a sequence of image frames with the same text content. Based on the text features of the Nth image frame, text recognition is performed on the (N+1)th image frame in the sequence to obtain the recognized text corresponding to the (N+1)th image frame, including: A sequence of image frames with the same text content is input into a pre-trained text recognition model, so that the text recognition model extracts the text features of the Nth image frame from the Nth image frame in the image frame sequence, and performs text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, so as to obtain the recognized text corresponding to the (N+1)th image frame.

6. The method according to claim 1, characterized in that, Based on the recognition text corresponding to each image frame, the recognition text corresponding to the image frame sequence is determined, including: Detect the confidence level of the recognized text corresponding to each image frame; The text with the highest confidence level is determined as the text corresponding to the image frame sequence.

7. The method according to claim 1, characterized in that, Also includes: The sequence of image frames with the same text content is extracted from the video.

8. The method according to claim 1, characterized in that, Extracting video frames from the video whose text content overlap is higher than a set overlap threshold includes: For each text group in each video frame of the video, perform text recognition to obtain the recognized character corresponding to each text group; Traverse the recognition characters corresponding to each text group in adjacent video frames to determine the number of target text groups with the same recognition characters in the adjacent video frames; If the number of target text groups in the adjacent video frames reaches a set condition, then the adjacent video frames are determined to be video frames whose text content overlap is higher than a set overlap threshold.

9. The method according to claim 8, characterized in that, If the number of target text groups in adjacent video frames reaches a set condition, then the adjacent video frames are determined to be video frames with a text content overlap higher than a set overlap threshold, including: Calculate the ratio of the number of target text groups in the adjacent video frames to the maximum number of text groups in the adjacent video frames; If the ratio is greater than a set value, it indicates that the number of target text groups in the adjacent video frames has reached the set condition, and the adjacent video frames are determined to be video frames with a text content overlap higher than the set overlap threshold.

10. The method according to claim 1, characterized in that, Also includes: The recognized text corresponding to all image frame sequences with the same text content in the video is combined to obtain the recognized text corresponding to the video.

11. A text recognition device, characterized in that, include: The extraction module is used to perform step S1, extracting the text features of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; Where N is a positive integer; The extraction process of the image frame sequence with the same text content includes: Extract video frames from the video whose text content overlaps with a set overlap threshold, and form a video frame sequence; The system identifies text in adjacent video frames whose distance from the target text group is within a set distance threshold range. The target text group is a group of texts in the adjacent video frames that have the same identified characters. The identified characters of each video frame in the adjacent video frames are obtained by performing text recognition on the set characters of each text group in the video frame. The text group is a text of a preset length in the text content of each video frame. The set characters are characters at a preset position and a preset number. If the text in the adjacent video frames is the same as the target text group at a distance within a set distance threshold, then the region where the target text group is located is extracted from the video frame sequence to form the image frame sequence; The recognition module is used to execute step S2, which involves performing text recognition on the (N+1)th image frame in the image frame sequence based on the text features of the Nth image frame, to obtain the recognized text corresponding to the (N+1)th image frame. The repeating module is used to execute step S3, let N=N+1, and control the extraction module to repeatedly execute the above step S1 and the recognition module to repeatedly execute the above step S2 until N is equal to the sequence number of the image frame sequence. Then, based on the recognition text corresponding to each image frame, the recognition text corresponding to the image frame sequence is determined.

12. An electronic device, characterized in that, include: Memory and processor; The memory is used to store programs; The processor is configured to implement the text recognition method as described in any one of claims 1 to 10 by running a program in the memory.

13. A storage medium, characterized in that, include: The storage medium stores a computer program, which, when executed by a processor, implements the text recognition method as described in any one of claims 1 to 10.

Citation Information

Patent Citations

  • Image similarity detection method and device, storage medium and electronic equipment

    CN111694978A

  • Character recognition model training method and device, and character recognition method and device

    CN113657399A

  • Video character recognition method and device, equipment and storage medium

    CN114332902A