Text content recognition method and device, electronic equipment and storage medium

By determining the fingertip position in video frame images and performing text detection when it is stable, combined with recognition models for different text types, the problem of low accuracy and efficiency in traditional recognition methods is solved, achieving more efficient and accurate text content recognition.

CN116778374BActive Publication Date: 2026-06-16GUANGZHOU XIBEISI INTELLIGENT TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU XIBEISI INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2023-05-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Traditional methods are inaccurate and time-consuming in identifying target objects, resulting in low efficiency.

Method used

By acquiring the fingertip position in the video frame image, candidate regions for text recognition are determined, and text detection is performed when the fingertip position is stable. Based on the stability judgment of the fingertip position, corresponding text recognition models are adopted for recognition according to different text types.

🎯Benefits of technology

By responding to changes in user behavior within shorter timeframes, the accuracy and efficiency of text content recognition are improved.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116778374B_ABST
    Figure CN116778374B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a text content recognition method and device, electronic equipment and storage medium. After obtaining the position of a fingertip by detecting the fingertip in a collected video frame image, the stable state of the fingertip position is confirmed, the candidate region for character recognition is determined based on the position of the fingertip, and the stable state of the fingertip position is confirmed again. While ensuring the efficiency of the candidate region for character recognition, the change in user behavior can be responded to and feedback can be given in a smaller time period, and a more accurate candidate region for character recognition is obtained. Text detection is performed on the candidate region for character recognition, a text region and a text type are obtained, and a corresponding text recognition model is used for text recognition according to different text types. The time consumption of text recognition is reduced, and the efficiency and accuracy of text recognition are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a text content recognition method, apparatus, electronic device, and storage medium. Background Technology

[0002] With the development of computer vision technology, the application scenarios for object recognition using image processing technology are increasing. In some scenarios, image processing technology can be used to identify the target object pointed to by a pointer (such as a finger, dictionary pen, etc.). For example, in the scenario of fingertip word lookup, a user can point to the word to be recognized on a reading material with their finger. Using image processing technology, the word pointed to by the user's finger can be recognized, thereby assisting the user in reading or learning new words. However, in traditional methods, the accuracy of object recognition is not high, and the overall time consumption is long, resulting in low recognition efficiency. Summary of the Invention

[0003] Based on this, this application provides a text content recognition method, apparatus, electronic device, and storage medium that can respond to changes in user behavior and provide feedback within a shorter time period, thereby improving the accuracy and efficiency of the text content recognition method.

[0004] As a first aspect of the embodiments of this application, a text content recognition method is provided, including the following steps:

[0005] Acquire the currently captured video frame image and determine the fingertip position in the video frame image;

[0006] If the fingertip position is stable within a preset first detection time period, the text recognition candidate region of the video frame image is determined based on the fingertip position, and text detection is performed on the text recognition candidate region.

[0007] If the fingertip position remains stable during the second detection period following the first detection period, a text detection result is obtained, wherein the text detection result includes a text region and the text type corresponding to the text region;

[0008] The text region is identified according to the text recognition method corresponding to the text type, and the text recognition result of the text region is obtained.

[0009] As a second aspect of the embodiments of this application, a text content recognition device is provided, including:

[0010] The fingertip positioning module is used to acquire the currently captured video frame image and determine the fingertip position in the video frame image;

[0011] The text recognition candidate region acquisition module is used to determine the text recognition candidate region of the video frame image based on the fingertip position if the fingertip position is stable within a preset first detection time period, and to perform text detection on the text recognition candidate region.

[0012] The text detection module is used to obtain a text detection result if the fingertip position remains stable during a second detection time period after the first detection time period. The text detection result includes a text region and the text type corresponding to the text region.

[0013] The text recognition module is used to recognize the text region according to the text recognition method corresponding to the text type, and obtain the text recognition result of the text region.

[0014] As a third aspect of the present application, an electronic device is provided, including a processor, a memory, and a computer program stored in the memory and executable on the processor; when the computer program is executed by the processor, it implements the steps of the text content recognition method as described in the first aspect.

[0015] As a fourth aspect of the present application, a storage medium is provided, the storage medium storing a computer program, which, when executed by a processor, implements the steps of the text content recognition method as described in the first aspect.

[0016] This application embodiment acquires a currently captured video frame image and determines the fingertip position in the video frame image. If the fingertip position is stable within a preset first detection time period, a candidate region for text recognition in the video frame image is determined based on the fingertip position. While performing text detection on the candidate region, the stability of the fingertip position is reassessed within a second detection time period. This allows for a smaller time frame to respond to changes in user behavior and provide feedback, resulting in a more accurate candidate region for text recognition and thus a more accurate text region. Furthermore, by performing text detection on the candidate region to obtain the text region and text type, and by employing appropriate text recognition models for different text types, the time consumption of text recognition is reduced, and the efficiency and accuracy of text recognition are improved.

[0017] To better understand and implement this application, the following detailed description is provided in conjunction with the accompanying drawings. Attached Figure Description

[0018] Figure 1 This provides the application environment for the text content recognition method provided in the first embodiment of this application.

[0019] Figure 2A flowchart illustrating the text content recognition method provided in the first embodiment of this application;

[0020] Figure 3 This is a flowchart illustrating step S2 of the text content recognition method provided in the second embodiment of this application.

[0021] Figure 4 This is a flowchart illustrating step S2 of the text content recognition method provided in the third embodiment of this application.

[0022] Figure 5 This is a flowchart illustrating step S2 of the text content recognition method provided in the fourth embodiment of this application.

[0023] Figure 6 This is a flowchart illustrating step S3 of the text content recognition method provided in the fifth embodiment of this application.

[0024] Figure 7 This is a flowchart illustrating step S3 of the text content recognition method provided in the sixth embodiment of this application;

[0025] Figure 8 This is a flowchart illustrating step S4 of the text content recognition method provided in the first embodiment of this application.

[0026] Figure 9 A flowchart illustrating the text content recognition method provided in the seventh embodiment of this application;

[0027] Figure 10 This is a flowchart illustrating step S4 of the text content recognition method provided in the first embodiment of this application.

[0028] Figure 11 This is a schematic diagram of the structure of the text content recognition device provided in the eighth embodiment of this application;

[0029] Figure 12 This is a schematic diagram of the structure of an electronic device provided in the ninth embodiment of this application. Detailed Implementation

[0030] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings. Wherein, when the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements.

[0031] It should be understood that the embodiments described below do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without inventive effort are within the scope of protection of this application.

[0032] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application are also intended to include the plural forms unless the context clearly indicates otherwise. Furthermore, in the description of this application, unless otherwise stated, “a plurality” means two or more. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more associated listed items, for example, A and / or B, which can represent: A alone, A and B together, and B alone; the character “ / ” generally indicates that the preceding and following objects are in an “or” relationship.

[0033] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, this information should not be limited to these terms, and these terms are only used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence, nor should they be construed as indicating or implying relative importance. Those skilled in the art can understand the specific meaning of the above terms in this application according to the specific circumstances. Depending on the context, the word "if" as used in this application can be interpreted as "when," "when," or "in response to determination."

[0034] The text content recognition method of this application can be executed by a text content recognition method device (hereinafter referred to as the recognition device). The text content recognition method device can be composed of two or more physical entities, or it can be composed of a single physical entity. The hardware referred to as the text content recognition method device essentially refers to computer equipment. For example, the text content recognition method device can be a computer, mobile phone, tablet, or smart interactive whiteboard, etc.

[0035] Please see Figure 1 , Figure 1The application environment for the text content recognition method provided in the first embodiment of this application includes a recognition device 20, which is placed on a desktop 10. The recognition device 20 has a built-in or external camera device 30. In use, the user places the text material 40 to be read on the desktop 10. The user points to the text content to be recognized or translated with their finger. The camera device 30 captures this image, obtaining a video frame. The recognition device 20 obtains the video frame image and processes it to recognize the text content pointed to by the user's finger. In some embodiments, the recognition device 20, which has a built-in or external camera device 30, can also be fixed or placed in other locations. In use, the user holds the text material 40 with one hand or other device within the shooting range of the camera device 30. The user points to the text content to be recognized or translated with their finger. The camera device 30 captures this image, obtaining a video frame. The recognition device 20 obtains the video frame image and processes it to recognize the text content pointed to by the user's finger.

[0036] Please see Figure 2 , Figure 2 The flowchart illustrates the text content recognition method provided in the first embodiment of this application. The method includes the following steps:

[0037] S1: Obtain the currently captured video frame image and determine the fingertip position in the video frame image.

[0038] In this embodiment, the identification device can capture a desktop video stream by using a camera device built into the identification device itself or an external camera device. The desktop video stream contains several video frame images.

[0039] The recognition device can determine the fingertip position in the video frame image by using a pre-trained fingertip recognition model, which can employ a CNN neural network model, to locate the fingertip in the video frame image. Specifically, the fingertip position can be the detected position of the fingertip in the video frame image.

[0040] Specifically, the recognition device inputs the video frame image into a preset fingertip recognition model. If the fingertip recognition model detects a fingertip, it obtains the fingertip coordinates in the detected video frame image as the fingertip position. If the fingertip recognition model does not detect a fingertip, it uses preset regression coordinates as the fingertip position, where the regression coordinates are coordinate data not belonging to the video frame image. This avoids the logical error problem where the fingertip recognition model cannot output a result when it does not detect a fingertip, ensuring the accuracy and usability of the fingertip recognition model's recognition results. Furthermore, it can calculate the offset based on the fingertip coordinates, aiding in the determination of fingertip stability.

[0041] S2: If the fingertip position is stable within a preset first detection time period, determine the text recognition candidate region of the video frame image based on the fingertip position, and perform text detection on the text recognition candidate region.

[0042] In this embodiment, if the fingertip position is stable within a preset first detection time period, the recognition device executes a process of determining the text recognition candidate region of the video frame image based on the fingertip position and performing text detection on the text recognition candidate region.

[0043] Specifically, the recognition device extends a first number of pixels to both sides of a first coordinate axis and a second number of pixels to both sides of a second coordinate axis, with the fingertip position as the center point. The first coordinate axis can be the axis containing one edge of the video frame image, and the second coordinate axis can be the axis containing the other edge of the video frame image. The first and second number of pixels can be the same or different.

[0044] The first and second pixel counts can be set to the same or different. The specific values ​​of the first and second pixel counts can be combined with, for example... Figure 1 The specific placement of the recognition device 20, the resolution of the camera device 30, and its height are determined. Typically, when a user uses the recognition device 20, there are several common usage scenarios, such as placing it on a study desk or fixing it in a specific location. When recognizing text content, it usually identifies text content from various types of printed materials, such as books, textbooks, or other printed texts of standard printing size. Therefore, the first number of pixels and the second number of pixels can be determined by combining the resolution of the camera device 30 (worn or connected) and the vertical distance of the camera device 30 relative to the text content pointed to by the finger in these common usage scenarios. In other embodiments, the first and second number of pixels can also be determined by acquiring user input.

[0045] The recognition device determines candidate regions for text recognition in the video frame image based on the areas formed by the expanded fingertip positions. Specifically, when the expanded fingertip position is outside the boundary of the video frame image, the recognition device determines the boundary corresponding to the expanded fingertip position outside the boundary of the video frame image, along with the area formed by other expanded fingertip positions, as candidate regions for text recognition in the video frame image. When the expanded fingertip position is outside the boundary of the video frame image, the boundary corresponding to the expanded fingertip position can be used as the basis for determining the candidate regions for text recognition, thereby improving the accuracy of the determined candidate regions for text recognition.

[0046] S3: If the fingertip position remains stable during the second detection period following the first detection period, a text detection result is obtained.

[0047] Since human fingers are constantly moving, to ensure the accuracy of fingertip positioning and avoid resource consumption caused by entering subsequent recognition processes due to false fingertip detection, in this embodiment, when the recognition device performs the process of determining the text recognition candidate region of the video frame image based on the fingertip position and performing text detection on the text recognition candidate region, if the fingertip position remains stable within a second detection time period after the first detection time period, a text detection result is obtained. The text detection result includes the text region and the corresponding text type. When determining the text recognition candidate region of the video frame image based on the fingertip position and performing text detection on the text recognition candidate region, the recognition device continuously judges the stability of the fingertip position in a separate thread. This allows it to respond to changes in user behavior and provide feedback within a shorter time period, obtaining more accurate text recognition candidate regions, thereby obtaining more accurate text regions and the corresponding text types, improving text detection efficiency.

[0048] S4: According to the text recognition method corresponding to the text type, the text region is recognized to obtain the text recognition result of the text region.

[0049] In this embodiment, the recognition device identifies the text region according to the text recognition method corresponding to the text type, and obtains the text recognition result of the text region. The text recognition method can employ an OCR (Optical Character Recognition) algorithm, which acquires text image information on paper through optical input methods such as scanning and photography, analyzes the morphological features of the characters using various pattern recognition algorithms, determines the standard encoding of the Chinese characters, and stores them in a text file in a common format. Alternatively, it can employ deep learning, training preset regression-based and detection-based neural network models to obtain a target neural network model for recognizing the text region.

[0050] This application embodiment acquires a currently captured video frame image and determines the fingertip position in the video frame image. If the fingertip position is stable within a preset first detection time period, a candidate region for text recognition in the video frame image is determined based on the fingertip position. While performing text detection on the candidate region, the stability of the fingertip position is reassessed within a second detection time period. This allows for a smaller time frame to respond to changes in user behavior and provide feedback, resulting in a more accurate candidate region for text recognition and thus a more accurate text region. Furthermore, by performing text detection on the candidate region to obtain the text region and text type, and by employing appropriate text recognition models for different text types, the time consumption of text recognition is reduced, and the efficiency and accuracy of text recognition are improved.

[0051] Please see Figure 3 , Figure 3 The flowchart of S2 in the text content recognition method provided in the second embodiment of this application is shown. It also includes a step of determining that the fingertip position is in a stable state within a preset first detection time period. This step includes S21 to S22, as follows:

[0052] S21: Obtain the first adjacent video frame image of the next frame of the video frame image, and the fingertip position in the first adjacent video frame image.

[0053] In this embodiment, the recognition device acquires the first adjacent video frame image of the next frame of the video frame image, and the fingertip position in the first adjacent video frame image. Specifically, the first detection time period can be set to 150ms. Based on the time corresponding to the video frame image, the recognition device obtains the video frame image of the next frame of the video frame image within the first detection time period, which is used as the first adjacent video frame image. The recognition device inputs the first adjacent video frame image into the fingertip recognition model for detection and recognition to obtain the fingertip position in the first adjacent video frame image.

[0054] S22: If the offset between the fingertip position of the first adjacent video frame image and the fingertip position of the video frame image is greater than or equal to a first offset threshold, it is determined that the fingertip position is not stable within a preset first detection time period. If the offset between the fingertip position of the first adjacent video frame image and the fingertip position of the video frame image is less than the first offset threshold, the first adjacent video frame image of the next frame after the first adjacent video frame image is obtained for offset determination. If the offset between the fingertip position of each first adjacent video frame image and the fingertip position of the video frame image within the first detection time period is less than the first offset threshold, it is determined that the fingertip position is stable within the preset first detection time period.

[0055] The first offset threshold can be set based on the resolution adjustment of the camera device or other factors. If the stability requirement is relatively low, the first offset threshold can be set relatively large, allowing the user to move within a relatively large range, which helps improve the user experience. If the stability requirement is relatively high, the first offset threshold can be set relatively small, restricting the user's movement to a relatively small range.

[0056] In this embodiment, the recognition device can calculate the offset between the fingertip position in the video frame image and the corresponding fingertip position in the first adjacent video frame image based on the coordinates corresponding to the fingertip position. The offset is compared with a preset first offset threshold. If the offset between the fingertip position in the first adjacent video frame image and the fingertip position in the video frame image is greater than or equal to the first offset threshold, it is determined that the fingertip position is not stable within a preset first detection time period. If the offset between the fingertip position in the first adjacent video frame image and the fingertip position in the video frame image is less than the first offset threshold, the next frame of the first adjacent video frame image is obtained for offset judgment. If the offset between the fingertip position in each of the first adjacent video frame images and the fingertip position in the video frame image within the first detection time period is less than the first offset threshold, it is determined that the fingertip position is stable within the preset first detection time period.

[0057] Please see Figure 4 , Figure 4 The flowchart of S2 in the text content recognition method provided in the third embodiment of this application is shown. It also includes a step of determining that the fingertip position is in a stable state within a preset first detection time period. This step includes S23, which is as follows:

[0058] S23: If the fingertip position is not stable within the preset first detection time period, return to the step of obtaining the currently acquired video frame image and determining the fingertip position in the video frame image.

[0059] In this embodiment, if the fingertip position is not stable within a preset first detection time period, the recognition device returns to the step of obtaining the currently acquired video frame image and determining the fingertip position in the video frame image, which improves the response speed of fingertip position positioning.

[0060] Please see Figure 5 , Figure 5 The flowchart of S2 in the text content recognition method provided in the fourth embodiment of this application is shown in the figure. It also includes steps S24 to S25, as follows:

[0061] S24: Obtain the fingertip speed within a preset first detection time period.

[0062] In this embodiment, the recognition device obtains the fingertip speed within a preset first detection time period. Specifically, the recognition device can extract a target adjacent video frame from several first adjacent video frames corresponding to the first detection time period. The target adjacent video frame can be the last first adjacent video frame ordered by time.

[0063] The recognition device calculates the fingertip speed within a preset first detection time period based on the fingertip position in the video frame image, the fingertip position in the last first adjacent video frame, and the time difference. This reflects the speed at which the user is currently moving their fingertip.

[0064] S25: Based on the fingertip speed and a preset fingertip speed threshold, when the fingertip speed is greater than the fingertip speed threshold, the first offset threshold is set to be greater than the second offset threshold; when the fingertip speed is less than or equal to the fingertip speed threshold, the first offset threshold is set to be less than or equal to the second offset threshold.

[0065] Due to user habits, fingertip movement speed may be relatively fast. In order to better match user habits and achieve fast and accurate text detection, in this embodiment, when the fingertip speed is greater than the fingertip speed threshold, the recognition device sets the first offset threshold to be greater than the second offset threshold, so that the user can quickly pass the first stability judgment, improving efficiency. Furthermore, based on the second offset threshold which is less than the first offset threshold, a second stability judgment is performed to improve the accuracy of text detection.

[0066] When the fingertip speed is less than or equal to the fingertip speed threshold, the recognition device sets the first offset threshold to be greater than or equal to the second offset threshold. The initial stability judgment adopts a stricter standard to improve the accuracy of text detection. Furthermore, based on the second offset threshold which is greater than the first offset threshold, a second stability judgment is performed to improve efficiency.

[0067] Please see Figure 6 , Figure 6The flowchart of S3 in the text content recognition method provided in the fifth embodiment of this application is shown. It also includes a step of determining that the fingertip position is in a stable state during a second detection time period after the first detection time period. This step includes S31 to S32, as follows:

[0068] S31: Obtain the second adjacent video frame image of the next frame of the video frame image, and the fingertip position in the second adjacent video frame image.

[0069] In this embodiment, the identification device acquires the second adjacent video frame image of the next frame of the video frame image; and the fingertip position in the second adjacent video frame image. For specific embodiments, please refer to step S21, which will not be repeated here.

[0070] S32: If the offset between the fingertip position of the second adjacent video frame image and the fingertip position of the video frame image is greater than or equal to the second offset threshold, it is determined that the fingertip position is not in a stable state during the second detection time period after the first detection time period. If the offset between the fingertip position of the second adjacent video frame image and the fingertip position of the video frame image is less than the second offset threshold, the offset is determined by obtaining the second adjacent video frame image of the next frame after the second adjacent video frame image. If the offset between the fingertip position of each second adjacent video frame image and the fingertip position of the video frame image during the second detection time period after the first detection time period is less than the first offset threshold, it is determined that the fingertip position is in a stable state during the second detection time period after the first detection time period.

[0071] In this embodiment, the recognition device calculates the deviation between the fingertip position in the video frame image and the corresponding fingertip position in the second adjacent video frame image. The device compares this deviation with a preset second deviation threshold. If the deviation between the fingertip position in the second adjacent video frame image and the fingertip position in the video frame image is greater than or equal to the second deviation threshold, it is determined that the fingertip position is not stable within the second detection time period following the first detection time period. If the deviation between the fingertip position in the second adjacent video frame image and the fingertip position in the video frame image is less than the second deviation threshold, the device obtains the next next frame of the second adjacent video frame image for deviation determination. If the deviation between the fingertip position in each of the second adjacent video frame images and the fingertip position in the video frame image within the second detection time period following the first detection time period is less than the first deviation threshold, it is determined that the fingertip position is stable within the second detection time period following the first detection time period.

[0072] Please see Figure 7 , Figure 7The flowchart of S3 in the text content recognition method provided in the sixth embodiment of this application is shown, and it also includes step S33, which is as follows:

[0073] S33: If the fingertip position is not stable in the second detection time period after the first detection time period, discard the text detection result, return to the step of obtaining the currently acquired video frame image, and determining the fingertip position in the video frame image.

[0074] In this embodiment, if the fingertip position is not stable within a second detection time period after the first detection time period, the text detection result is discarded, and the process returns to obtaining the currently acquired video frame image. The step of determining the fingertip position in the video frame image is based on the moment when the fingertip position is determined to be not stable within the second detection time period after the first detection time period. The step of determining the text recognition candidate region of the video frame image at that moment is then obtained. This allows for a smaller time period to respond to changes in user behavior and provide feedback, resulting in more accurate text recognition candidate regions and improving the accuracy and efficiency of text detection.

[0075] Please see Figure 8 , Figure 8 The flowchart of S4 in the text content recognition method provided in the first embodiment of this application is shown below, including steps S41 to S42, as follows:

[0076] S41: Input the candidate region for character recognition into a preset text detection model, perform text detection on the candidate region for character recognition, and obtain the initial text region in the candidate region for character recognition, as well as the text type corresponding to the initial text region.

[0077] The text region detection model includes a segmentation module and a classification module. The segmentation module uses the DBNet text detection network, and the classification module includes detection channels for handwritten, printed, and non-text types.

[0078] In this embodiment, the recognition device inputs the candidate region for character recognition into the text detection model, uses the DBNet segmentation algorithm to segment the text region in the candidate region for character recognition, obtains the initial text region in the candidate region for character recognition, and obtains the text type corresponding to the initial text region based on the detection channel corresponding to the preset text type. The text type includes handwritten type, printed type and non-handwritten type, and the initial text region includes several characters.

[0079] S42: By finding the minimum bounding polygon, each character in the initial text region is expanded to obtain the expanded region of each character in the initial text region, which is used as the text region.

[0080] In this embodiment, the recognition device expands each character of the initial text region by finding the minimum bounding polygon, obtaining the expanded region of each character in the initial text region, which serves as the text region. Specifically, the recognition device can call the minimum bounding polygon function of OpenCV to obtain the expanded region of the initial text region.

[0081] Please see Figure 9 , Figure 9 This is a flowchart illustrating the text content recognition method provided in the seventh embodiment of this application. The steps for training the text detection model include S5 to S7. Before step S4, steps S5 to S7 are as follows:

[0082] S5: Obtain several training sample images and the corresponding label data for the training sample images.

[0083] Training sample images can be images containing fingertips and text content pointed to by those fingertips. In this embodiment, the recognition device obtains several training sample images and corresponding label data for each training sample image. The label data includes the initial target text region of each training sample image and the text type corresponding to that initial target text region. The initial target text region can be an initial text region identified manually, and the text type corresponding to that initial target text region can be a text type annotated manually.

[0084] S6: Input several training sample images into the text detection model to obtain the initial sample text region of each training sample image and the text type corresponding to the initial sample text region.

[0085] In this embodiment, the recognition device inputs several training sample images into the text detection model to obtain the initial sample text region of each training sample image and the text type corresponding to the initial sample text region.

[0086] S7: Calculate the text region loss based on the initial sample text region and the initial target text region corresponding to each of the training sample images; calculate the text type loss based on the text type corresponding to the initial sample text region and the initial target text region corresponding to each of the training sample images; and train the text detection model by combining the text region loss and the text type loss to obtain the target text detection model.

[0087] Text region loss is a parameter used to characterize the difference between the initial sample text region and the initial target text region of a training sample image. It reflects the amount of loss in the text region of the training sample image during the detection process. Text region loss can be calculated and determined in various ways. For example, in some embodiments, the difference between the initial target text region and the initial sample text region can be used as the text region loss. In other embodiments, based on the initial target text region and the initial sample text region, other methods can be used to calculate the text region loss, such as mean squared error.

[0088] Text type loss is a parameter used to characterize the difference between the text type corresponding to the initial sample text region and the text type corresponding to the target sample text region in a training sample image. It reflects the amount of text type loss in the training sample images during the detection process. Text type loss can be calculated and determined in various ways. For example, in some embodiments, the difference between the text type corresponding to the target sample text region and the text type corresponding to the initial sample text region can be used as the text type loss. In other embodiments, based on the text types corresponding to the target sample text region and the initial sample text region, other methods can be used to calculate the text type loss, such as mean squared error.

[0089] In this embodiment, the recognition device calculates the text region loss based on the initial sample text region and the initial target text region corresponding to each of the training sample images, and calculates the text type loss based on the text type corresponding to the initial sample text region and the initial target text region corresponding to each of the training sample images. The text detection model is trained by combining the text region loss and the text type loss to obtain the target text detection model.

[0090] Please see Figure 10 , Figure 10 The flowchart of S4 in the text content recognition method provided in the first embodiment of this application is shown below, including steps S43 to S44, as follows:

[0091] S43: Based on the text type and the preset correspondence between text type and text recognition model, obtain the text recognition model corresponding to the text type.

[0092] To improve the efficiency and accuracy of text recognition, in this embodiment, the recognition device pre-sets a correspondence between text types and text recognition models. Based on the text type and the pre-set correspondence between text types and text recognition models, the text recognition model corresponding to the text type is obtained. The correspondence includes several different text types and text recognition models corresponding to each different text type.

[0093] Specifically, if the text type is handwritten, due to the varying sizes and overlapping characteristics of characters within the handwritten text region, the recognition device employs a classic CNN+Bi-LSTM model for text recognition. The Bi-LSTM model learns the text features of the handwritten text region to minimize the negative impact of handwriting on text recognition, resulting in more accurate text recognition results for the handwritten text region. If the text type is printed, compared to handwritten text, the size structure of characters in the printed text region is more stable and easily distinguishable. Furthermore, there is no overlapping between characters like in handwritten text. Therefore, compared to handwritten text, the recognition device can use only a CNN model without needing to combine it with a Bi-LSTM module to learn the text features of the printed text region. This ensures accurate recognition of the printed text region while obtaining the text recognition result more quickly. If the text type is non-text, the recognition device discards the text region and returns to the step of locating the fingertip in the video frame image to obtain the fingertip position in the video frame image.

[0094] Compared to general text recognition models, such as CRNN (Convolutional Recurrent Neural Network), this embodiment can use corresponding text recognition models for different text types, making full use of the characteristics of different text types and improving the efficiency and accuracy of text recognition.

[0095] S44: Based on the text region and the text recognition model corresponding to the text type, the text region is recognized to obtain the text recognition result.

[0096] In this embodiment, the recognition device identifies the text region based on the text region and the text recognition model corresponding to the text type, thereby obtaining the text recognition result. By employing appropriate text recognition models for different text types, the time consumed in text recognition is reduced, and the efficiency and accuracy of text recognition are improved.

[0097] Please refer to Figure 11 , Figure 11 This is a schematic diagram of the structure of a text content recognition device provided in the eighth embodiment of this application. The device can implement all or part of the text content recognition method through software, hardware, or a combination of both. The text content recognition device 11 includes:

[0098] The fingertip positioning module 111 is used to acquire the currently captured video frame image and determine the fingertip position in the video frame image;

[0099] The text recognition candidate region acquisition module 112 is used to determine the text recognition candidate region of the video frame image based on the fingertip position if the fingertip position is stable within a preset first detection time period, and to perform text detection on the text recognition candidate region.

[0100] The text detection module 113 is used to obtain a text detection result if the fingertip position remains stable during a second detection time period after the first detection time period. The text detection result includes a text region and the text type corresponding to the text region.

[0101] The text recognition module 114 is used to recognize the text region according to the text recognition method corresponding to the text type, and obtain the text recognition result of the text region.

[0102] In this embodiment, a fingertip positioning module acquires a currently captured video frame image and determines the fingertip position within the video frame image. A text recognition candidate region acquisition module, if the fingertip position is stable within a preset first detection time period, determines a text recognition candidate region for the video frame image based on the fingertip position and performs text detection on the candidate region. A text detection module, if the fingertip position remains stable within a second detection time period after the first detection time period, obtains a text detection result, which includes a text region and the text type corresponding to the text region. A text recognition module, according to the text recognition method corresponding to the text type, recognizes the text region and obtains the text recognition result for the text region. After obtaining the fingertip position by detecting the fingertip in the captured video frame image, the stable state of the fingertip position is confirmed. The candidate region for text recognition is determined based on the fingertip position. At the same time, the stable state of the fingertip position is confirmed again. While ensuring the efficiency of text recognition candidate region extraction, it can respond to changes in user behavior and make feedback in a shorter time period, obtaining more accurate text recognition candidate regions. Text detection is performed on the text recognition candidate regions to obtain the text region and text type. For different text types, corresponding text recognition models are used to perform text recognition, reducing the time consumption of text recognition and improving the efficiency and accuracy of text recognition.

[0103] Please see Figure 12 , Figure 12This is a schematic diagram of the structure of an electronic device provided in the ninth embodiment of this application. This application also provides an electronic device, including: a processor 121, a memory 122, and a computer program 123 stored in the memory 122 and executable on the processor; the electronic device can store multiple instructions, which are applicable to being loaded by the processor and executing the method steps of embodiments one to five above. For the specific execution process, please refer to the detailed descriptions of embodiments one to five, which will not be repeated here.

[0104] The processor may include one or more processing cores. The processor 121 connects to various parts within the electronic device via various interfaces and lines. It executes various functions of the text content recognition device 11 and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 122, and by calling data stored in the memory 122. Optionally, the processor 121 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 121 may integrate one or more of the following: a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content to be displayed on the touch screen; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 121 and may be implemented as a separate chip.

[0105] The memory 122 may include random access memory (RAM) or read-only memory. Optionally, the memory 122 may include a non-transitory computer-readable storage medium. The memory 122 may be used to store instructions, programs, code, code sets, or instruction sets. The memory 122 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions), instructions for implementing the various method embodiments described above, etc.; the data storage area may store data involved in the various method embodiments described above, etc. Optionally, the memory 122 may also be at least one storage device located remotely from the aforementioned processor 121.

[0106] This application also provides a storage medium that can store multiple instructions. These instructions are applicable to being loaded and executed by a processor using the method steps described in Embodiments 1 to 5 above. For details of the execution process, please refer to the specific descriptions of Embodiments 1 to 5, which will not be repeated here.

[0107] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0108] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0109] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0110] In the embodiments provided in this application, it should be understood that the disclosed devices / terminal equipment and methods can be implemented in other ways. For example, the device / terminal equipment embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0111] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0112] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0113] If the integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms.

[0114] This application is not limited to the above-described embodiments. If any modifications or variations to this application do not depart from the spirit and scope of this application, and if such modifications and variations fall within the scope of the claims and equivalent technologies of this application, then this application also intends to include such modifications and variations.

Claims

1. A text content recognition method, characterized in that, Includes the following steps: Acquire the currently captured video frame image and determine the fingertip position in the video frame image; Obtain the fingertip speed within a preset first detection time period; When the fingertip speed is greater than the fingertip speed threshold, the first offset threshold is set to be greater than the second offset threshold; when the fingertip speed is less than or equal to the fingertip speed threshold, the first offset threshold is set to be less than or equal to the second offset threshold. If the fingertip position is determined to be stable within a preset first detection time period based on the first offset threshold, the text recognition candidate region of the video frame image is determined according to the fingertip position, and text detection is performed on the text recognition candidate region. If, based on the second offset threshold, it is determined that the fingertip position remains stable during the second detection period following the first detection period, a text detection result is obtained, wherein the text detection result includes a text region and the text type corresponding to the text region; The text region is identified according to the text recognition method corresponding to the text type, and the text recognition result is obtained.

2. The text content recognition method according to claim 1, characterized in that, Obtaining the text detection result includes the following steps: The candidate region for character recognition is input into a preset text detection model to perform text detection on the candidate region for character recognition, thereby obtaining the initial text region in the candidate region for character recognition and the text type corresponding to the initial text region, wherein the initial text region includes several characters; By finding the minimum bounding polygon, each character in the initial text region is expanded to obtain the expanded region of each character in the initial text region, which is then used as the text region.

3. The text content recognition method according to claim 2, characterized in that, It also includes the step of training the text detection model, which includes: Obtain several training sample images and corresponding label data for the training sample images, wherein the label data includes the initial target text region of each training sample image and the text type corresponding to the initial target text region; Several training sample images are input into the text detection model to obtain the initial sample text region of each training sample image and the text type corresponding to the initial sample text region; Based on the initial sample text region and the initial target text region corresponding to each of the training sample images, calculate the text region loss. Based on the text type corresponding to the initial sample text region and the initial target text region corresponding to each of the training sample images, calculate the text type loss. Combine the text region loss and the text type loss to train the text detection model and obtain the target text detection model.

4. The text content recognition method according to claim 1, characterized in that, The step of recognizing the text region according to the text recognition method corresponding to the text type to obtain the text recognition result includes the following steps: Based on the text type and the preset correspondence between text types and text recognition models, obtain the text recognition model corresponding to the text type; Based on the text region and the text recognition model corresponding to the text type, the text region is recognized to obtain the text recognition result.

5. The text content recognition method according to claim 1, characterized in that, It also includes the following steps: If the fingertip position is not stable during the second detection time period after the first detection time period, discard the text detection result, return to the step of obtaining the currently acquired video frame image, and determine the fingertip position in the video frame image.

6. The text content recognition method according to claim 1, characterized in that, It also includes the following steps: If the fingertip position is not stable within a preset first detection time period, return to the step of acquiring the currently collected video frame image and determining the fingertip position in the video frame image.

7. The text content recognition method according to any one of claims 1 to 6, characterized in that, It also includes a step of determining that the fingertip position is in a stable state within a preset first detection time period, which includes: Obtain the first adjacent video frame image of the next frame of the video frame image, and the fingertip position in the first adjacent video frame image; If the offset between the fingertip position of the first adjacent video frame image and the fingertip position of the video frame image is greater than or equal to a first offset threshold, it is determined that the fingertip position is not in a stable state within a preset first detection time period. If the offset between the fingertip position of the first adjacent video frame image and the fingertip position of the video frame image is less than the first offset threshold, the first adjacent video frame image of the next frame after the first adjacent video frame image is obtained for offset determination. If the offset between the fingertip position of each first adjacent video frame image and the fingertip position of the video frame image within the first detection time period is less than the first offset threshold, it is determined that the fingertip position is in a stable state within the preset first detection time period.

8. The text content recognition method according to claim 7, characterized in that, It also includes a step of determining that the fingertip position is in a stable state during a second detection time period after the first detection time period, which includes: Obtain the second adjacent video frame image of the next frame of the video frame image; and the fingertip position in the second adjacent video frame image; If the offset between the fingertip position of the second adjacent video frame image and the fingertip position of the video frame image is greater than or equal to a second offset threshold, it is determined that the fingertip position is not in a stable state during the second detection time period after the first detection time period. If the offset between the fingertip position of the second adjacent video frame image and the fingertip position of the video frame image is less than the second offset threshold, the offset is determined by obtaining the second adjacent video frame image of the next frame after the second adjacent video frame image. If the offset between the fingertip position of each second adjacent video frame image and the fingertip position of the video frame image during the second detection time period after the first detection time period is less than the first offset threshold, it is determined that the fingertip position is in a stable state during the second detection time period after the first detection time period.

9. A text content recognition device, characterized in that, include: The fingertip positioning module is used to acquire the currently captured video frame image and determine the fingertip position in the video frame image; Obtain the fingertip speed within a preset first detection time period; When the fingertip speed is greater than the fingertip speed threshold, the first offset threshold is set to be greater than the second offset threshold; when the fingertip speed is less than or equal to the fingertip speed threshold, the first offset threshold is set to be less than or equal to the second offset threshold. The text recognition candidate region acquisition module is used to determine the text recognition candidate region of the video frame image based on the fingertip position when the fingertip position is determined to be stable within a preset first detection time period based on the first offset threshold, and to perform text detection on the text recognition candidate region. The text detection module is used to determine, based on the second offset threshold, that the fingertip position remains stable during a second detection time period after the first detection time period, and to obtain a text detection result, wherein the text detection result includes a text region and the text type corresponding to the text region; The text recognition module is used to recognize the text region according to the text recognition method corresponding to the text type, and obtain the text recognition result of the text region.

10. An electronic device, characterized in that, include: A processor, a memory, and a computer program stored in the memory and executable on the processor; the computer program, when executed by the processor, implements the steps of the text content recognition method as described in any one of claims 1 to 8.

11. A storage medium, characterized in that: The storage medium stores a computer program, which, when executed by a processor, implements the steps of the text content recognition method as described in any one of claims 1 to 8.