Method, device and equipment for performance evaluation of an OCR system and readable storage medium
By matching the text recognition results and annotation results of the OCR system in video scenarios and calculating the recognition accuracy parameters, the lack of performance evaluation of OCR systems in video scenarios is solved, and high-precision text detection and tracking are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GRG BANKING EQUIPMENT CO LTD
- Filing Date
- 2023-07-28
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies lack performance evaluation methods for OCR systems in video scenarios, making it impossible to effectively assess missed detections, false detections, and multiple detections in text recognition within videos.
This paper proposes a performance evaluation method for OCR systems. By matching the text recognition results of video frames with the annotation results, the text recognition accuracy parameter in each matching pair is calculated to evaluate the performance of the OCR system.
A high-precision OCR system performance evaluation was achieved in video scenarios, which can detect missed detections, false detections, and multiple detections, thereby improving the accuracy of text recognition and tracking.
Smart Images

Figure CN116978032B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of optical character recognition technology, and in particular to a performance evaluation method, apparatus, device, and readable storage medium for an OCR system. Background Technology
[0002] Optical Character Recognition (OCR) is a crucial research area in computer vision, possessing immense value in real-world production environments. With the advancement and widespread adoption of artificial intelligence, OCR technology is being applied in an increasing number of scenarios. The use of OCR for text recognition is no longer limited to high-definition images; many scenarios involving video data are also incorporating OCR technology for intelligent transformation, such as video text retrieval and video text information verification. However, the development of any technology relies heavily on a comprehensive evaluation system. It is precisely the tireless pursuit of evaluation metrics by experts that has driven the continuous iteration and development of OCR technology.
[0003] Currently, evaluation metrics for OCR technology are limited to images and only target specific, single subtasks such as OCR detection or OCR recognition. However, in practical applications, systematic and business-oriented OCR systems are often used, so there is still a lack of evaluation methods for OCR systems in video scenarios. Summary of the Invention
[0004] This invention aims to at least partially solve one of the technical problems in related technologies. To this end, the first objective of this invention is to propose a performance evaluation method for an OCR system. This method can aggregate the text recognition results and text annotation results of the same captured scenes in a video, and then match the aggregated text recognition results with the text annotation results, thus achieving text recognition and text tracking. By calculating a first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, this method achieves text detection for missed detections, false detections, and multiple detections. Therefore, by decomposing the text detection, text tracking, and text recognition processes, a high-precision evaluation of the OCR system in video scenarios is achieved.
[0005] The second objective of this invention is to provide a performance evaluation device for an OCR system.
[0006] The third objective of this invention is to provide an electronic device.
[0007] The fourth objective of this invention is to provide a computer-readable storage medium.
[0008] To achieve the above objectives, a first aspect of the present invention proposes a performance evaluation method for an OCR system, the method comprising: inputting multiple video frames into an OCR system for text recognition, and obtaining N... D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; based on the text annotation results for each video frame, N is obtained. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; D One character recognition target and N G Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mappde Less than or equal to min(N) D N G ); Calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair; Evaluate the performance of the OCR system based on each first recognition accuracy parameter.
[0009] According to the performance evaluation method of the OCR system of the present invention, multiple video frames are input into the OCR system for text recognition to obtain N. D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; based on the text annotation results for each video frame, N is obtained. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; D One character recognition target and N G Matching N text-labeled targets yields N mappedN matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G The process involves calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair; and evaluating the performance of the OCR system based on each first recognition accuracy parameter. This allows for the aggregation of text recognition results and text annotation results from the same captured scene in a video, followed by matching the aggregated text recognition results with the text annotation results, thus achieving text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. By decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation of the OCR system in video scenarios is realized.
[0010] According to one embodiment of the present invention, N D One character recognition target and N G Matching each text-labeled target yields... There are N matching pairs, including: calculating each character recognition target and N respectively. G The distance between each text label target is used to perform matching, resulting in N. mapped A number of matching pairs.
[0011] According to one embodiment of the present invention, each character recognition target and N are calculated respectively. G The distance between each text-labeled target includes: for each video frame, calculating the overlap between the first text region corresponding to text recognition target j and the second text region corresponding to text labeling target i, where j = 1, 2…N D i = 1, 2, ..., N G The overlap between the first text region corresponding to the text recognition target j and the second text region corresponding to the text annotation target i in each video frame after accumulation is taken as the distance between the text recognition target j and the text annotation target i.
[0012] According to one embodiment of the present invention, the Hungarian algorithm is used to perform distance matching calculations.
[0013] According to one embodiment of the present invention, calculating a first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair includes: determining that the text recognition target in matching pair k matches at least one target video frame of the text annotation target in matching pair k, where k = 1, 2…N mappedFor each target video frame, calculate the overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k, and the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k; based on each product and the number of target video frames, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in matching pair k.
[0014] According to one embodiment of the present invention, calculating a first recognition accuracy parameter between text recognition targets and text annotation targets in matching pair k based on the number of each product and target video frames includes: calculating the first recognition accuracy parameter between text recognition targets and text annotation targets in matching pair k according to a first formula, wherein the first formula is:
[0015]
[0016] Among them, SingleAcc k To match the first recognition accuracy parameter between the text recognition target and the text annotation target in k, To match the number of target video frames that match the text recognition target in pair k. To match the first text region corresponding to the text recognition target in k within the target video frame t. To match the second text region corresponding to the text annotation target in k within the target video frame t. To determine the matching degree between the first text content corresponding to the text recognition target and the second text content corresponding to the text annotation target in the target video frame t.
[0017] According to one embodiment of the present invention, evaluating the performance of an OCR system based on each first recognition accuracy parameter includes: calculating a second recognition accuracy parameter of the OCR system based on each first recognition accuracy parameter; and evaluating the performance of the OCR system based on the second recognition accuracy parameter.
[0018] According to one embodiment of the present invention, calculating a second recognition accuracy parameter of the OCR system based on each first recognition accuracy parameter includes: calculating the second recognition accuracy parameter of the OCR system according to a second formula, wherein the second formula is:
[0019]
[0020] TotalAcc is the second recognition accuracy parameter.
[0021] To achieve the above objectives, a second aspect of the present invention provides a performance evaluation device for an OCR system. The device includes: a recognition module, used to input multiple video frames into the OCR system for text recognition, obtaining N... D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; the annotation module is used to obtain N based on the text annotation results for each video frame. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; the matching module is used to match N. D One character recognition target and N G Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G The calculation module is used to calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair; the evaluation module is used to evaluate the performance of the OCR system based on each first recognition accuracy parameter.
[0022] According to an embodiment of the present invention, the performance evaluation device of the OCR system inputs multiple video frames into the OCR system for text recognition via a recognition module, and obtains N. D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; based on the text annotation results for each video frame, the annotation module is used to obtain N. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; through the matching module, N is... D One character recognition target and NG Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G The system employs a calculation module to calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair. An evaluation module assesses the performance of the OCR system based on each first recognition accuracy parameter. This allows for the aggregation of text recognition results and text annotation results from the same captured scene in a video. Matching these aggregated text recognition results with the text annotation results achieves text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair and evaluating the performance of the OCR system based on this parameter, text detection of missed, false, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, a high-precision evaluation of the OCR system in video scenarios is realized.
[0023] To achieve the above objectives, a third aspect of the present invention provides an electronic device, comprising: a memory, a processor, and a program stored in the memory and executable on the processor, wherein when the processor executes the program, it implements the performance evaluation method of the OCR system of the first aspect embodiment.
[0024] In the aforementioned electronic device, the text recognition results and text annotation results of the same captured scene in the video can be collected separately, and then the collected text recognition results are matched with the text annotation results to achieve text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios is realized.
[0025] To achieve the above objectives, a fourth aspect of the present invention provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the performance evaluation method of the OCR system of the first aspect embodiment.
[0026] The aforementioned computer-readable storage medium allows for the collection of text recognition results and text annotation results from the same captured scene in a video. These are then combined, and the collected text recognition results are matched with the text annotation results, thus achieving text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios is realized.
[0027] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0028] Figure 1 A flowchart illustrating an embodiment of a performance evaluation method for an OCR system provided by the present invention;
[0029] Figure 2 This is a flowchart illustrating a method for matching text recognition targets with text annotation targets according to an embodiment of the present invention.
[0030] Figure 3 A flowchart illustrating a method for calculating text content matching degree according to an embodiment of the present invention;
[0031] Figure 4 A flowchart illustrating a second embodiment of a performance evaluation method for an OCR system provided by the present invention;
[0032] Figure 5 A schematic diagram of the structure of a performance evaluation device for an OCR system provided in an embodiment of the present invention;
[0033] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0034] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0035] The following description, with reference to the accompanying drawings, outlines the performance evaluation method, apparatus, device, and readable storage medium for the OCR system proposed in this invention.
[0036] In this application, references Figure 1 As shown, Figure 1 This is a flowchart illustrating a performance evaluation method for an OCR system according to an embodiment of the present invention. The executing entity of this embodiment can be any electronic device with processing capabilities. The performance evaluation method for the OCR system provided in this embodiment may include the following steps:
[0037] S101, input multiple video frames into the OCR system for text recognition, and obtain N. D One character recognition target.
[0038] Wherein, each character recognition target corresponds to at least one first character region and a first character content within the first character region, the first character content corresponding to each character recognition target is the same, and the first character regions corresponding to each character recognition target are within a preset region error range, N D It is a positive integer.
[0039] Specifically, video can be captured by an image acquisition device, and multiple video frames can be determined from the video. The obtained multiple video frames are input into an OCR system for text recognition to obtain the text recognition result corresponding to each video frame. Each text recognition result may include a first text region and the first text content within the first text region. From these text recognition results, texts with the same first text content and whose first text regions are within a preset area error range are identified as the same text recognition target.
[0040] Specifically, since there may be many identical shooting scenes among the video frames in the same video, and the text recognition results of different video frames corresponding to the same shooting scene are different, it can reflect the performance of the OCR system. Therefore, by identifying the first text content in the text recognition results of different video frames as the same text recognition target, the text recognition results of the same shooting scene can be collected, so that the collected text recognition results can be compared with the text annotation results in the future.
[0041] For example, suppose that after the OCR system performs text recognition on video frame 5, it determines that the text recognition result for video frame 5 includes "beautiful flowers" (first text content) located at position 1 (first text region) and "blue sky and white clouds" located at position 2; after the OCR system performs text recognition on video frame 7, it determines that the text recognition result for video frame 7 includes "beautiful flowers" located at a similar position to position 1 (i.e., within the preset area error range of position 1); after the OCR system performs text recognition on video frame 9, it determines that the text recognition result for video frame 9 includes "beautiful flowers" located at a similar position to position 1 (i.e., within the preset area error range of position 1) and "blue sky and white clouds" located at a similar position to position 2 (i.e., within the preset area error range of position 2); after the OCR system performs text recognition on video frame 10, it determines that the text recognition result for video frame 10 includes "blue sky and white clouds" located at a similar position to position 2 (i.e., within the preset area error range of position 2).
[0042] Therefore, the text recognition results for "beautiful flowers" in video frame 5, video frame 7, and video frame 9 can be identified as the same text recognition target "beautiful flowers" (text recognition target 1); similarly, the text recognition results for "blue sky and white clouds" in video frame 5, video frame 9, and video frame 10 can also be identified as the same text recognition target "blue sky and white clouds" (text recognition target 2).
[0043] Therefore, text recognition target 1 corresponds to: "beautiful flowers" located at position 1 in video frame 5, "beautiful flowers" located at a similar position to position 1 in video frame 7, and "beautiful flowers" located at a similar position to position 1 in video frame 9; text recognition target 2 corresponds to: "blue sky and white clouds" located at position 2 in video frame 5, "blue sky and white clouds" located at a similar position to position 2 in video frame 9, and "blue sky and white clouds" located at a similar position to position 2 in video frame 10.
[0044] S102, based on the text annotation results for each video frame, obtain N. G The target is marked with a text label.
[0045] Each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is identical, and the second text regions corresponding to each text annotation target are within a preset region error range. N G It is a positive integer.
[0046] Specifically, after determining multiple video frames, each video frame can be annotated using annotation tools or manual annotation to obtain text annotation results corresponding to each video frame. Each text annotation result can include a second text region and the second text content within the second text region. From these text annotation results, texts with the same second text content and whose second text regions are within the preset area error range are identified as the same text annotation target.
[0047] For example, suppose that after text annotation of video frame 5, the text annotation results corresponding to video frame 5 include "beautiful flowers" (second text content) at position 1 (second text area), "blue sky and white clouds" at position 2, and "distant houses" at position 3; after text annotation of video frame 7, the text annotation results corresponding to video frame 7 include "beautiful flowers" at a similar position to position 1 (i.e., within the preset area error range of position 1) and "distant houses" at a similar position to position 3 (i.e., within the preset area error range of position 3); after text annotation of video frame 9, the text annotation results corresponding to video frame 9 include "beautiful flowers" at a similar position to position 1 (i.e., within the preset area error range of position 1) and "blue sky and white clouds" at a similar position to position 2 (i.e., within the preset area error range of position 2); after text annotation of video frame 10, the text annotation results corresponding to video frame 10 include "blue sky and white clouds" at a similar position to position 2 (i.e., within the preset area error range of position 2).
[0048] Therefore, the text labels “beautiful flowers” in the text labeling results corresponding to video frame 5, video frame 7, and video frame 9 can be identified as the same text labeling target “beautiful flowers” (text labeling target 1); the text labels “blue sky and white clouds” in the text labeling results corresponding to video frame 5, video frame 9, and video frame 10 can be identified as the same text labeling target “blue sky and white clouds” (text labeling target 2); and the text labels “distant houses” in the text labeling results corresponding to video frame 5 and video frame 7 can be identified as the same text labeling target “distant houses” (text labeling target 3).
[0049] Therefore, the text label target 1 corresponds to: "beautiful flowers" located at position 1 in video frame 5, "beautiful flowers" located at a similar position to position 1 in video frame 7, and "beautiful flowers" located at a similar position to position 1 in video frame 9; the text label target 2 corresponds to: "blue sky and white clouds" located at position 2 in video frame 5, "blue sky and white clouds" located at a similar position to position 2 in video frame 9, and "blue sky and white clouds" located at a similar position to position 2 in video frame 10; the text label target 3 corresponds to: "distant houses" located at position 3 in video frame 5 and "distant houses" located at a similar position to position 3 in video frame 7.
[0050] S103, N D One character recognition target and N G Matching N text-labeled targets yields N mapped A number of matching pairs.
[0051] Each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G ).
[0052] Specifically, N obtained through S101 can be... D The character recognition target and N obtained after S102 G N text-labeled targets are matched to achieve a dual match of text content and text position between the text recognition target and the text-labeled target. However, since there may be text recognition targets that cannot be matched with the text-labeled targets, N... mapped Less than or equal to min(N) D N G ).
[0053] S104, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair.
[0054] Specifically, since each text recognition target corresponds to the text recognition result of at least one video frame, and each text annotation target corresponds to the text annotation result of at least one video frame, by calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, the recognition accuracy of each text recognition target in the corresponding different video frames can be calculated, that is, the cross-frame detection and recognition accuracy of each text recognition target can be calculated.
[0055] S105, evaluate the performance of the OCR system based on each first recognition accuracy parameter.
[0056] Specifically, the text recognition performance of an OCR system can be evaluated by the accuracy of cross-frame detection and recognition of each text target obtained through text recognition by the OCR system.
[0057] The performance evaluation method for the OCR system provided in this invention involves inputting multiple video frames into the OCR system for text recognition to obtain N. D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; based on the text annotation results for each video frame, N is obtained. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; D One character recognition target and N G Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G The process involves calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair; and evaluating the performance of the OCR system based on each first recognition accuracy parameter. This allows for the aggregation of text recognition results and text annotation results from the same captured scene in a video, followed by matching the aggregated text recognition results with the text annotation results, thus achieving text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. By decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation of the OCR system in video scenarios is realized.
[0058] In some embodiments, the above will N D One character recognition target and N G Matching N text-labeled targets yields N mapped A number of matching pairs can be included, which may be: calculating each character recognition target and N respectively. GThe distance between each text label target is used to perform matching, resulting in N. mapped A number of matching pairs.
[0059] Specifically, the distance between the text recognition target and the text annotation target can characterize the correlation between them. If the correlation is large, the text recognition target and the text annotation target can be matched to obtain N. mapped A number of matching pairs.
[0060] Specifically, assuming we have 20 text recognition targets (text recognition target 1, text recognition target 2... text recognition target 20) and 25 text annotation targets (text annotation target 1, text annotation target 2... text annotation target 25), we calculate the distance between each text recognition target and the 25 text annotation targets as follows: calculate the distance between text recognition target 1 and text annotation target 1, the distance between text recognition target 1 and text annotation target 2, ... the distance between text recognition target 1 and text annotation target 25, calculate the distance between text recognition target 2 and text annotation target 1, the distance between text recognition target 2 and text annotation target 2, ... the distance between text recognition target 20 and text annotation target 1, the distance between text recognition target 20 and text annotation target 2, ... the distance between text recognition target 20 and text annotation target 25.
[0061] In this embodiment of the invention, by calculating each character recognition target and N respectively G The distance between each text label target is used to perform matching, resulting in N. mapped There are several matching pairs. Therefore, the correlation between the text recognition target and the text annotation target can be determined based on the distance between them. The text recognition targets and text annotation targets with higher correlation are then matched to obtain matching pairs, thereby improving the accuracy of text tracking.
[0062] In some embodiments, the above calculations of each character recognition target and N are performed respectively. G The distance between text-labeled targets can include: for each video frame, calculating the overlap between the first text region corresponding to text recognition target j and the second text region corresponding to text labeling target i, where j = 1, 2…N D i = 1, 2, ..., N G The overlap between the first text region corresponding to the text recognition target j and the second text region corresponding to the text annotation target i in each video frame after accumulation is taken as the distance between the text recognition target j and the text annotation target i.
[0063] Specifically, the overlap of text regions between the text recognition target and the text annotation target can characterize the distance between them. Therefore, the overlap of text regions between the text recognition target and the text annotation target can also characterize the correlation between them. If the correlation is large, the text recognition target and the text annotation target can be matched to obtain a matching pair.
[0064] Specifically, the overlap between the text regions corresponding to the text recognition target and the text annotation target in each video frame can be determined by calculating the overlap between the first text region corresponding to the text recognition target and the second text region corresponding to the text annotation target in the same video frame. The calculated overlap rates between the same text recognition target and the same text annotation target in all video frames are then summed to obtain the overlap rate between each text recognition target and all text annotation targets.
[0065] Specifically, in order to reduce the amount of computation, it can be determined in advance whether there is an overlap between the first text region corresponding to the text recognition target and the second text region corresponding to the text annotation target in each video frame. If there is, the cross-union ratio (overlap ratio) between the overlapping first text region and the second text region is calculated, and the cross-union ratios between the same text recognition target and the text annotation target in all video frames are accumulated to obtain the overlap ratio between each text recognition target and all text annotation targets.
[0066] In this embodiment of the invention, for each video frame, the overlap between the first text region corresponding to the text recognition target j and the second text region corresponding to the text annotation target i is calculated, where j = 1, 2…N D i = 1, 2, ..., N G The overlap between the first text region corresponding to text recognition target j and the second text region corresponding to text annotation target i in each accumulated video frame is used as the distance between text recognition target j and text annotation target i. Therefore, the correlation between text recognition targets and text annotation targets can be determined based on the overlap, and text recognition targets and text annotation targets with greater correlation are matched to obtain matching pairs, thereby improving the accuracy of text tracking.
[0067] Specifically, the Hungarian algorithm can be used to calculate distance matching, thereby obtaining matching pairs with a higher degree of matching. For example, the size can be defined as (N G N D A matrix M, whose elements M ij This represents the distance between the i-th labeled text target and the j-th recognized text target. (The rest of the text appears to be a typo and can be left as is.) Figure 2The matching method shown calculates the values of each element in matrix M and uses them as the basis for matching the text recognition target with the text annotation target. For example... Figure 2 As shown, Figure 2 This invention provides a method for matching text recognition targets with text annotation targets. The method may include the following steps:
[0068] S201, initialize all elements in matrix M to 0.
[0069] S202, traverse the video starting from the first video frame.
[0070] S203, determine whether there are overlapping first and second text regions.
[0071] If yes, then execute S204; otherwise, execute S206.
[0072] S204, calculate the crossover ratio between the overlapping first and second text regions.
[0073] S205, corresponding to M ij Sum the intersection-union ratios.
[0074] S206, determine if this is the last video frame.
[0075] If yes, then execute S207; otherwise, execute S208.
[0076] S207 uses the Hungarian algorithm to calculate the distance and obtain matching pairs.
[0077] S208, fetch the next video frame.
[0078] Specifically, after executing S208, the process returns to execute S203.
[0079] In this embodiment of the invention, the Hungarian algorithm can be used to calculate the distance between the text recognition target and the text annotation target through the above-described matching method, thereby obtaining a matching pair with a high matching ratio.
[0080] In some embodiments, calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair may include: determining that the text recognition target in matching pair k matches at least one target video frame of the text annotation target in matching pair k, where k = 1, 2…N mappedFor each target video frame, calculate the overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k, and the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k; based on each product and the number of target video frames, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in matching pair k.
[0081] Specifically, based on the above text recognition and text annotation results, it can be determined that the text recognition target in each matching pair matches at least one target video frame of the text annotation target in that matching pair; alternatively, based on the above matching process, it can be determined that the text recognition target in each matching pair matches at least one target video frame of the text annotation target in that matching pair. For example, assuming that matching pair 2 includes text recognition target 2 and text annotation target 4, it can be determined that text recognition target 2 matches text annotation target 4 in video frame 1, video frame 10, and video frame 20.
[0082] Specifically, for each target video frame, the degree of overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k can characterize the positional correlation between the text recognition target and the text annotation target in matching pair k in each target video frame.
[0083] Furthermore, for each target video frame, the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k can characterize the recognition accuracy of the text content of the text recognition target and the text annotation target in each target video frame.
[0084] Therefore, by multiplying the overlap between the first text region corresponding to the text recognition target in the matching pair k and the second text region corresponding to the text annotation target in the matching pair k in each target video frame, and the matching degree between the first text content corresponding to the text recognition target in the matching pair k and the second text content corresponding to the text annotation target in the matching pair k in each target video frame, we can simultaneously obtain the positional correlation of the text recognition target and the text annotation target in the matching pair k in each target video frame and the recognition accuracy of the text content in each target video frame.
[0085] Specifically, based on the aforementioned product corresponding to each target video frame and the number of target video frames, a first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k can be calculated. This first recognition accuracy parameter can characterize the ability of the OCR system to recognize each text recognition target.
[0086] Specifically, for each target video frame, the second text content corresponding to the text annotation target in matching pair k can be used as a benchmark. The first text content corresponding to the text recognition target in matching pair k is then compared with it, and the matching degree between the text content is determined based on the comparison result. In practice, through comparison, it can be determined whether there are missing characters, extra recognized characters, or misrecognized characters in the first text content. The matching degree between the text content can be determined based on at least one of the missing characters, extra recognized characters, and misrecognized characters.
[0087] For example, it can be done through Figure 3 The method for calculating the text content matching degree shown obtains the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k in each target video frame. The method for calculating the text content matching degree provided in this embodiment may include the following steps:
[0088] S301, for each target video frame, the second text content corresponding to the text annotation target in the matching pair k is used as the benchmark, and the first text content corresponding to the text recognition target in the matching pair k is compared with it.
[0089] S302, Determine whether there are any missing characters in the first text content.
[0090] If yes, then execute S303; otherwise, execute S304.
[0091] S303, determine the number of missing characters as A.
[0092] Where A is an integer greater than 0.
[0093] S304, the number of missing characters is determined to be 0.
[0094] Specifically, after executing S303 and S304, S305 is executed.
[0095] S305, Determine whether there are multiple characters in the first text content.
[0096] If yes, then execute S306; otherwise, execute S307.
[0097] S306, determine the number of extra-recognized characters as D.
[0098] Where D is an integer greater than 0.
[0099] S307, Determine the number of extra-recognized characters to be 0.
[0100] Specifically, after executing S306 and S307, S308 is executed.
[0101] S308, Determine whether there are misidentified characters in the first text content.
[0102] If yes, then execute S309; otherwise, execute S310.
[0103] S309, determine the number of misidentified characters as F.
[0104] Where F is an integer greater than 0.
[0105] S310, determine that the number of misidentified characters is 0.
[0106] Specifically, after executing S309 and S310, S311 is executed.
[0107] S311, calculate the number of characters in the second text content as S.
[0108] Where S is an integer greater than 0.
[0109] S312, determine whether the sum of the number of missing characters, extra recognized characters, and misrecognized characters is less than the number of characters in the second text content.
[0110] If yes, then execute S313; otherwise, execute S314.
[0111] S313, calculate the matching degree as follows:
[0112] S314, the matching degree is calculated to be 0.
[0113] In this embodiment of the invention, at least one target video frame is determined to match the text recognition target in matching pair k with the text annotation target in matching pair k, where k = 1, 2…N. mapped For each target video frame, calculate the product of the overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k, and the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k. Based on each product and the number of target video frames, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in matching pair k. Thus, the recognition accuracy of each text recognition target in its corresponding target video frames can be calculated, i.e., the cross-frame detection and recognition accuracy of each text recognition target can be calculated.
[0114] In some embodiments, calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k based on each product and the number of target video frames may include: calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k according to a first formula, wherein the first formula is:
[0115]
[0116] Among them, SingleAcc k To match the first recognition accuracy parameter between the text recognition target and the text annotation target in k, To match the number of target video frames that match the text recognition target in pair k. To match the first text region corresponding to the text recognition target in k within the target video frame t. To match the second text region corresponding to the text annotation target in k within the target video frame t. To determine the matching degree between the first text content corresponding to the text recognition target and the second text content corresponding to the text annotation target in the target video frame t.
[0117] In this embodiment of the invention, the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k is calculated by the first formula described above. Thus, the recognition accuracy of each text recognition target in the corresponding target video frame can be calculated by the first formula described above, that is, the cross-frame detection and recognition accuracy of each text recognition target can be calculated.
[0118] In some embodiments, the above-described evaluation of the performance of the OCR system based on each first recognition accuracy parameter may include: calculating a second recognition accuracy parameter of the OCR system based on each first recognition accuracy parameter; and evaluating the performance of the OCR system based on the second recognition accuracy parameter.
[0119] Specifically, a second recognition accuracy parameter characterizing the text recognition performance of the OCR system can be calculated by statistically analyzing the cross-frame recognition accuracy parameters of each text recognition target obtained by the OCR system.
[0120] In this embodiment of the invention, a second recognition accuracy parameter of the OCR system is calculated using each first recognition accuracy parameter; the performance of the OCR system is then evaluated based on the second recognition accuracy parameter. Thus, by statistically analyzing the cross-frame recognition accuracy parameters of each text recognition target obtained by the OCR system, a second recognition accuracy parameter characterizing the text recognition performance of the OCR system can be calculated, thereby achieving high-precision evaluation of the OCR system in video scenarios.
[0121] In some embodiments, calculating the second recognition accuracy parameter of the OCR system based on each first recognition accuracy parameter may include: calculating the second recognition accuracy parameter of the OCR system according to a second formula, wherein the second formula is:
[0122]
[0123] TotalAcc is the second recognition accuracy parameter.
[0124] Specifically, the second recognition accuracy parameter of the OCR system can be obtained by summing the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and then dividing by the average of the number of text recognition targets and the number of text annotation targets. In the second formula above, the numerator reflects the detection and recognition accuracy of the matching targets. The higher the detection and recognition accuracy value, the higher the overall score of the OCR system, indicating better recognition performance. However, when multiple detections and missed detections occur, the numerator remains unchanged, but the denominator increases significantly, leading to a decrease in the overall score of the OCR system.
[0125] Specifically, the performance level of the OCR system and the correspondence between the performance level and the second recognition accuracy parameter of the OCR system can be preset. Then, after obtaining the second recognition accuracy parameter of the OCR system, the current performance level of the OCR system can be determined according to the above correspondence.
[0126] Specifically, the second recognition accuracy parameter can be compared with a preset threshold. If the second recognition accuracy parameter is greater than or equal to the preset threshold, the performance of the OCR system is determined to be "passable"; if the second recognition accuracy parameter is less than the preset threshold, the performance of the OCR system is determined to be "fail".
[0127] Specifically, one can also determine whether the text recognition performance of the improved OCR system has been improved by calculating the second recognition accuracy parameter of the current OCR system and comparing them.
[0128] Specifically, by calculating the second recognition accuracy parameter of different OCR systems and comparing them, it can be determined which OCR system has better text recognition performance.
[0129] In this embodiment of the invention, a second recognition accuracy parameter of the OCR system is calculated using a second formula. Therefore, by breaking down the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios can be achieved.
[0130] As a concrete example, Figure 4A flowchart illustrating a second embodiment of a performance evaluation method for an OCR system provided by the present invention is shown below. Figure 4 As shown, the performance evaluation method for the OCR system provided in this embodiment may include the following steps:
[0131] S401, input multiple video frames into the OCR system for text recognition, and obtain N. D Each text recognition target includes a text detection box in at least one video frame corresponding to each text recognition target and the first text content within that text detection box.
[0132] The text detection box mentioned above is the first text area mentioned earlier.
[0133] S402, based on the text annotation results for each video frame, obtain N. G Each text-labeled target, and at least one text-labeled box in at least one video frame corresponding to each text-labeled target, and the second text content within the text-labeled box.
[0134] The text annotation box mentioned above is the second text area mentioned earlier.
[0135] S403, calculate N for each character recognition target. G The distance between N text-labeled targets is calculated using the Hungarian algorithm to match the distances, thus obtaining N text recognition targets and text-labeled targets. mapped A number of matching pairs.
[0136] Specifically, it can be achieved through the above Figure 2 The matching method shown calculates the value of each element in matrix M and uses it as the basis for matching the text recognition target with the text annotation target.
[0137] S404, determine that the text recognition target in each matching pair matches at least one target video frame of the text annotation target in the matching pair.
[0138] S405, using the first formula, calculates the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair.
[0139] Specifically, the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair is calculated using the first formula, based on the overlap between the text detection box corresponding to the text recognition target in each matching pair and the text annotation box corresponding to the text annotation target in the matching pair, and the matching degree between the first text content corresponding to the text recognition target in each matching pair and the second text content corresponding to the text annotation target in each matching pair.
[0140] S406, using the second formula, calculate the second recognition accuracy parameter of the OCR system based on each first recognition accuracy parameter.
[0141] S407, evaluate the performance of the OCR system based on the second recognition accuracy parameter.
[0142] In this embodiment of the invention, the text recognition results and text annotation results of the same shooting scene in the video can be collected separately, and then the collected text recognition results and text annotation results are matched to realize text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection for missed detection, false detection, and multiple detection is realized. Thus, by decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios is achieved.
[0143] Figure 5 This is a schematic diagram of the structure of a performance evaluation device for an OCR system provided in an embodiment of the present invention. Figure 5 As shown, the performance evaluation device 50 of the OCR system may include: a recognition module 510, an annotation module 520, a matching module 530, a calculation module 540, and an evaluation module 550.
[0144] The recognition module 510 can be used to input multiple video frames into the OCR system for text recognition, and obtain N. D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; annotation module 520 can be used to obtain N based on the text annotation results for each video frame. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G It is a positive integer; the matching module 530 can be used to match N D One character recognition target and N G Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D NG The calculation module 540 can be used to calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair; the evaluation module 550 can be used to evaluate the performance of the OCR system based on each first recognition accuracy parameter.
[0145] The performance evaluation device for the OCR system provided in this embodiment of the invention uses a recognition module to input multiple video frames into the OCR system for text recognition, obtaining N... D N text recognition targets, wherein each text recognition target corresponds to at least one first text region and a first text content within the first text region, the first text content corresponding to each text recognition target is the same, and the first text regions corresponding to each text recognition target are within a preset region error range, N D N is a positive integer; based on the text annotation results for each video frame, the annotation module is used to obtain N. G There are N text annotation targets, where each text annotation target corresponds to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. G N is a positive integer; through the matching module, N is... D One character recognition target and N G Matching N text-labeled targets yields N mapped N matching pairs, where each matching pair includes a text recognition target and a text annotation target, N mapped Less than or equal to min(N) D N G The system employs a calculation module to calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair. An evaluation module assesses the performance of the OCR system based on each first recognition accuracy parameter. This allows for the aggregation of text recognition results and text annotation results from the same captured scene in a video. Matching these aggregated text recognition results with the text annotation results achieves text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair and evaluating the performance of the OCR system based on this parameter, text detection of missed, false, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, a high-precision evaluation of the OCR system in video scenarios is realized.
[0146] In addition, corresponding to the performance evaluation method of the OCR system provided in the above embodiments, this invention also provides an electronic device, such as... Figure 6As shown, the electronic device 60 may include: a memory 610, a processor 620, and a program stored in the memory 610 and executable on the processor 620. When the processor 620 executes the program, it implements all the steps of the performance evaluation method of the OCR system provided in this embodiment of the invention.
[0147] In the aforementioned electronic device, the text recognition results and text annotation results of the same captured scene in the video can be collected separately, and then the collected text recognition results are matched with the text annotation results to achieve text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios is realized.
[0148] In addition, corresponding to the performance evaluation method of the OCR system provided in the above embodiments, the present invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements all the steps of the performance evaluation method of the OCR system of the present invention.
[0149] The aforementioned computer-readable storage medium allows for the collection of text recognition results and text annotation results from the same captured scene in a video. These are then combined, and the collected text recognition results are matched with the text annotation results, thus achieving text recognition and text tracking. By calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in each matching pair, and evaluating the performance of the OCR system based on each first recognition accuracy parameter, text detection of missed detections, false detections, and multiple detections is achieved. Thus, by decomposing the text detection, text tracking, and text recognition processes, high-precision evaluation based on the OCR system in video scenarios is realized.
[0150] It should be noted that the logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0151] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0152] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0153] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0154] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0155] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A performance evaluation method for an OCR system, characterized in that, The method includes: Multiple video frames are input into the OCR system for text recognition, resulting in... There are 10 character recognition targets, wherein each character recognition target corresponds to at least one first character region and a first character content within the first character region, the first character content corresponding to each character recognition target is the same, and the first character regions corresponding to each character recognition target are within a preset region error range. It is a positive integer; Based on the text annotation results for each video frame, we obtain There are 10 text annotation targets, each corresponding to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. It is a positive integer; Will Character recognition target and Matching each text-labeled target yields... A number of matching pairs, wherein each matching pair includes a text recognition target and a text annotation target. Less than or equal to min ( , ); Calculate a first recognition accuracy parameter between the text recognition target and the text annotation target in each of the matching pairs; The performance of the OCR system is evaluated based on each of the first recognition accuracy parameters; The calculation of the first recognition accuracy parameter between the text recognition target and the text annotation target in each of the matching pairs includes: Determine at least one target video frame in which the text recognition target in matching pair k matches the text annotation target in matching pair k, where k=1, 2… ; For each target video frame, calculate the overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k, and the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k. Based on the number of each product and the number of target video frames, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k.
2. The method according to claim 1, characterized in that, The Character recognition target and Matching each text-labeled target yields... There are 10 matching pairs, including: Calculate each character recognition target and The distance between each text label target is used to perform matching, resulting in... A number of matching pairs.
3. The method according to claim 2, characterized in that, The calculation of each character recognition target and The distance between each text label target includes: For each video frame, calculate the overlap between the first text region corresponding to text recognition target j and the second text region corresponding to text annotation target i, where j=1, 2… i=1, 2… ; The overlap between the first text region corresponding to the text recognition target j and the second text region corresponding to the text annotation target i in each video frame after accumulation is taken as the distance between the text recognition target j and the text annotation target i.
4. The method according to claim 3, characterized in that, The distance is calculated using the Hungarian algorithm.
5. The method according to claim 1, characterized in that, The step of calculating the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k based on the number of each product and the number of target video frames includes: According to the first formula, the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k is calculated, where the first formula is: in, To match the first recognition accuracy parameter between the text recognition target and the text annotation target in k, To match the number of target video frames that match the text recognition target in pair k. To match the first text region corresponding to the text recognition target in k within the target video frame t. To match the second text region corresponding to the text annotation target in k within the target video frame t. To determine the matching degree between the first text content corresponding to the text recognition target and the second text content corresponding to the text annotation target in the target video frame t.
6. The method according to claim 5, characterized in that, The evaluation of the performance of the OCR system based on each of the first recognition accuracy parameters includes: Calculate the second recognition accuracy parameter of the OCR system based on each of the first recognition accuracy parameters; The performance of the OCR system is evaluated based on the second recognition accuracy parameter.
7. The method according to claim 6, characterized in that, The step of calculating the second recognition accuracy parameter of the OCR system based on each of the first recognition accuracy parameters includes: The second recognition accuracy parameter of the OCR system is calculated according to the second formula, wherein the second formula is: in, This is the second recognition accuracy parameter.
8. A performance evaluation device for an OCR system, characterized in that, The device includes: The recognition module is used to input multiple video frames into the OCR system for text recognition and obtain... There are 10 character recognition targets, wherein each character recognition target corresponds to at least one first character region and a first character content within the first character region, the first character content corresponding to each character recognition target is the same, and the first character regions corresponding to each character recognition target are within a preset region error range. It is a positive integer; The annotation module is used to obtain the text annotation results for each video frame. There are 10 text annotation targets, each corresponding to at least one second text region and the second text content within that second text region. The second text content corresponding to each text annotation target is the same, and the second text regions corresponding to each text annotation target are within a preset region error range. It is a positive integer; The matching module is used to match... Character recognition target and Matching each text-labeled target yields... A number of matching pairs, wherein each matching pair includes a text recognition target and a text annotation target. Less than or equal to min ( , ); The calculation module is used to calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in each of the matching pairs; An evaluation module is used to evaluate the performance of the OCR system based on each of the first recognition accuracy parameters; The calculation module is further configured to determine at least one target video frame in which the text recognition target in matching pair k matches the text annotation target in matching pair k, where k=1, 2… ; For each target video frame, calculate the overlap between the first text region corresponding to the text recognition target in matching pair k and the second text region corresponding to the text annotation target in matching pair k, and the matching degree between the first text content corresponding to the text recognition target in matching pair k and the second text content corresponding to the text annotation target in matching pair k. Based on the number of each product and the number of target video frames, calculate the first recognition accuracy parameter between the text recognition target and the text annotation target in the matching pair k.
9. An electronic device, characterized in that, include: A memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, It stores a program that, when executed by a processor, implements the method according to any one of claims 1-7.