Proofreading task detection method and device, electronic equipment and storage medium

By utilizing reference attribute information for error pre-setting and automated detection during the digitization of ancient books, the problem of poor OCR recognition of handwritten characters was solved, achieving efficient and accurate quality detection and optimization of proofreading tasks, and improving the overall performance of the proofreading system.

CN122244885APending Publication Date: 2026-06-19BEIJING ZITIAO NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date
2024-12-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the OCR recognition effect of handwritten text in the process of digitizing ancient books is not good, resulting in a heavy workload for manual proofreading and difficulty in ensuring accuracy. The number of professional personnel is limited, and it is impossible to effectively guarantee the proofreading quality.

Method used

By determining the reference attribute information of the target proofreading task, including text correction rate, correction omissions, and correction doubts, and using the reference text in the optical character recognition results to preset errors, the error types in the proofreading task are simulated, thereby achieving automated quality detection and optimizing the proofreading process.

🎯Benefits of technology

It improved the accuracy and efficiency of proofreading tasks, reduced omissions and uncertainties, enhanced the performance of the text recognition and proofreading system, and reduced the error rate.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244885A_ABST
    Figure CN122244885A_ABST
Patent Text Reader

Abstract

This disclosure provides a proofreading task detection method, apparatus, electronic device, and storage medium. The method includes: determining reference attribute information for a target proofreading task, the reference attribute information including at least one of first attribute information, second attribute information, and third attribute information; the first attribute information indicating the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task; the second attribute information indicating the text correction omissions when performing text correction on multiple second reference texts; and the third attribute information indicating the text correction doubts when performing text correction on multiple second reference texts; and performing proofreading task quality detection on the target proofreading task based on the reference attribute information. This disclosure can detect low-quality proofreading tasks that are careless, inattentive, or blindly retain machine-recognizable characters during the execution of proofreading tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data processing technology, and in particular to a method, apparatus, electronic device, and storage medium for detecting calibration tasks. Background Technology

[0002] With the development of information technology, the digitization of ancient books has emerged, transforming traditional paper-based ancient books into digital content. Text recognition in ancient books involves converting text in images into character codes. Text recognition typically uses OCR (Optical Character Recognition) technology, which can be applied to various scenarios, including web page text, printed text, handwritten text, and scene text. However, handwritten text recognition is generally ineffective, especially since ancient books often contain a large amount of handwritten, illegible, or incomplete text. OCR recognition of this type of text often yields poor results and fails to meet digitization requirements, necessitating manual proofreading. This task is highly specialized. However, the sheer scale of ancient book digitization makes manual proofreading extremely demanding, and the number of professionals in the field is limited. Many of those involved in manual proofreading are not professionals, making it difficult to guarantee the accuracy of the results. Summary of the Invention

[0003] This disclosure provides a proofreading task detection method, apparatus, electronic device, and storage medium to detect low-quality proofreading tasks that are performed carelessly, inattentively, or blindly retain machine-readable characters.

[0004] In a first aspect, embodiments of this disclosure provide a method for detecting calibration tasks, the method comprising:

[0005] The reference attribute information for the target proofreading task is determined. The reference attribute information includes at least one of a first attribute, a second attribute, and a third attribute. The first attribute is used to indicate the text correction rate when the target proofreading task is performed on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is formed by erroneously presetting a first reference text. The second attribute is used to indicate the text correction omission when the target proofreading task is performed on multiple second reference texts. The third attribute is used to indicate the text correction doubt when the target proofreading task is performed on multiple second reference texts. The first reference text and the second reference text are the text in the target text recognition result obtained by optical character recognition corresponding to the target proofreading task.

[0006] The proofreading task quality is checked based on the reference attribute information.

[0007] Secondly, embodiments of this disclosure also provide a calibration task detection device, the device comprising:

[0008] A determination module is used to determine reference attribute information for a target proofreading task. The reference attribute information includes at least one of a first attribute, a second attribute, and a third attribute. The first attribute is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is formed by erroneously presetting a first reference text. The second attribute is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are texts in the target text recognition result obtained by optical character recognition corresponding to the target proofreading task.

[0009] The detection module is used to perform proofreading task quality detection on the target proofreading task based on the reference attribute information.

[0010] Thirdly, this disclosure also provides an electronic device, the electronic device comprising:

[0011] At least one processor; and

[0012] A memory communicatively connected to the at least one processor; wherein,

[0013] The memory stores a computer program that can be executed by the at least one processor, which enables the at least one processor to perform the calibration task detection method described in any of the above embodiments.

[0014] Fourthly, this disclosure also provides a computer-readable medium storing computer instructions that, when executed by a processor, implement the verification task detection method described in any of the above embodiments.

[0015] The technical solution of this disclosure, by calculating a first attribute information based on determining a preset erroneous text containing errors and a proofread output text containing text output after a target proofreading task is performed on the preset erroneous text, helps to accurately quantify the ability to detect and correct errors during the proofreading process. This allows for targeted improvements to the proofreading task, increasing the accuracy of text correction and avoiding the omission of erroneous text that should have been corrected. The second attribute information is used to indicate the omission of text correction when performing a target proofreading task on multiple second reference texts, helping to clarify the omissions in the entire proofreading task. The system identifies whether any undiscovered and uncorrected errors exist, reducing the chances of missed errors during the actual proofreading process and improving the completeness and accuracy of the proofreading results. The third attribute information indicates any doubts regarding text correction when performing text correction on multiple second reference texts, improving the accuracy of the proofreading task and preventing the retention of questionable but actually erroneous text as correct content, or the unnecessarily altering correct text as incorrect. This provides a clear picture of the handling of uncertain text during the proofreading process, enabling the assessment of whether uncertain factors were handled appropriately and avoiding impacts on correction quality and reliability due to improper handling of questionable text. By utilizing at least one of the above attribute information, the quality detection process for proofreading tasks can be automated and systematically completed. This allows for targeted optimization and improvement of the proofreading task, continuously enhancing the performance and accuracy of the entire text recognition and proofreading system and reducing the error rate.

[0016] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0017] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale.

[0018] Figure 1 This is a schematic flowchart of a calibration task detection method provided in an embodiment of this disclosure;

[0019] Figure 2 This is a schematic diagram of a process for error presetting of optical character recognition text during a proofreading task provided by an embodiment of this disclosure;

[0020] Figure 3 This is a schematic flowchart of another calibration task detection method provided in this embodiment of the disclosure;

[0021] Figure 4 This is a flowchart illustrating another calibration task detection method provided in this embodiment of the present disclosure;

[0022] Figure 5 This is a schematic diagram of a process for calculating the missing text correction rate during a proofreading task detection process provided in this embodiment of the present disclosure;

[0023] Figure 6 This is a schematic diagram of the process of calculating image similarity during the detection of a proofreading task provided in this embodiment of the present disclosure;

[0024] Figure 7 This is a schematic diagram illustrating the construction process of a character accuracy prediction model during the proofreading task detection process provided in this embodiment of the present disclosure;

[0025] Figure 8 This is a flowchart illustrating another calibration task detection method provided in this embodiment of the present disclosure;

[0026] Figure 9 This is a schematic diagram of a process for calculating the text correction doubt rate during the proofreading task detection process provided in this embodiment of the present disclosure;

[0027] Figure 10 This is a schematic diagram of the structure of a calibration task detection device provided in an embodiment of this disclosure;

[0028] Figure 11 This is a schematic diagram of the structure of an electronic device that implements a proofreading task detection method according to an embodiment of this disclosure. Detailed Implementation

[0029] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0030] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.

[0031] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0032] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0033] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0034] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0035] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0036] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.

[0037] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0038] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0039] Figure 1This is a flowchart illustrating a proofreading task detection method provided in an embodiment of this disclosure. This embodiment is applicable to the detection of proofreading quality of manual proofreading tasks performed after OCR text recognition. The proofreading task detection method can be executed by a proofreading task detection device, which can be implemented in the form of software and / or hardware, and is generally integrated on any electronic device with network communication function, such as a mobile terminal, PC, or server.

[0040] like Figure 1 As shown, the proofreading task detection method of this disclosure embodiment may include the following process:

[0041] S110. Determine the reference attribute information for the target proofreading task. The reference attribute information includes at least one of the first attribute information, the second attribute information, and the third attribute information. The first attribute information is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is a text formed by performing error pre-setting on the first reference text. The second attribute information is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute information is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are the texts in the target text recognition result obtained by performing optical character recognition corresponding to the target proofreading task.

[0042] In this process, the first and second reference texts can be text directly identified from text images using optical character recognition (OCR) technology. These first and second reference texts serve as the starting point for the entire proofreading task. Subsequent evaluations of the text correction rate based on the first reference text, as well as the handling of doubts and omissions in text correction when performing the target proofreading task on the second reference text, all revolve around these first and second reference texts. For example, the first and second reference texts can be text content recognized from scanned paper document text images using OCR. The first and second reference texts are the text in the target text recognition results obtained through OCR corresponding to the target proofreading task.

[0043] Among them, see Figure 2, the preset error text is the text formed by presetting errors for the first reference text before performing the target proofreading task. The preset error set includes several preset error texts. Each preset error text is formed by purposefully presetting errors for the first reference text obtained by performing optical character recognition (OCR) on the text image. For example, if the first reference text recognized by OCR is "天", it may be preset as "夫" as the preset error text. These preset error texts after error presetting together constitute the preset error set, whose purpose is to set "traps" in the proofreading task to detect whether the preset error texts are corrected in the proofreading task. The second reference text is the text that participates in the evaluation of the doubt about text correction and the evaluation of the omission of text correction among the various reference texts for text correction in the target proofreading task.

[0044] In an optional example, obtain the reference text recognized from the text image by optical character recognition technology. This step is usually to perform text recognition on the input text image using OCR technology. For example, after scanning a page of a paper book, the text content of the page obtained using OCR technology can be used as the reference text. Multiple first reference texts can be screened out from these recognized reference texts, and error presetting operations can be performed on these first reference texts, such as setting according to common types of errors such as similar-shaped character errors, homophone errors, and grammar errors. For example, for the first reference text "已", according to the error presetting rule of similar-shaped characters, it is changed to "己", thus forming a preset error text corresponding to the first reference text. Optionally, the error presetting operation can be implemented by writing code or using a dedicated error presetting tool. Combining all the texts after error presetting determines the preset error set. Multiple second reference texts can also be screened out from these recognized reference texts. After the target proofreading task is completed and text correction is performed for these first reference texts, the evaluation of the doubt about text correction and the omission of text correction generated when correcting the second reference text can be calculated.

[0045] Forming preset error texts by presetting errors for the first reference text can actively create known error situations, providing conditions for accurately detecting the proofreading quality in the follow-up. It simulates the possible error types in actual OCR recognition, making the proofreading task more challenging and targeted. Without error presetting, proofreading may simply confirm the OCR recognition result and it is difficult to discover potential problems; with the preset error texts of error presetting, it can clearly test whether the proofreader or system can recognize and correct these specific errors, thus effectively evaluating their proofreading ability. By comparing the preset error texts and the proofreading output texts, the execution situation of the proofreading task can be comprehensively understood.

[0046] Among them, the reference attribute information can be a set of information indicators that comprehensively reflect the execution of the target proofreading task. It covers multiple attribute descriptions closely related to the text proofreading quality. By analyzing and processing the preset error text corresponding to the first reference text and the proofreading output text after the target proofreading task is performed and the preset error text is corrected, the number of preset error texts that are correctly corrected can be statistically determined (text correction rate). By analyzing the text correction situation when performing the target proofreading task on multiple second reference texts, it is possible to statistically determine whether there are any omissions or uncorrected errors (text correction omissions) and whether there are any doubts about certain text modifications during the proofreading process that have not been properly handled (text correction doubts). This allows for better measurement of proofreading quality, identification of weaknesses in the proofreading process, and further optimization of proofreading strategies and improvement of the overall OCR recognition and proofreading effect.

[0047] The target proofreading task can be a pre-defined specific task that requires proofreading the target text recognition results. The target text is typically the text obtained through optical character recognition (OCR) of text images. This target proofreading task can be carried out in various scenarios, such as book publishing proofreading, electronic document proofreading, and legal document proofreading, aiming to ensure the accuracy and reliability of text content and eliminate errors that may arise from OCR recognition and mistakes made during manual proofreading.

[0048] The first attribute information measures the proportion of pre-defined erroneous text corresponding to the first reference text that can be correctly corrected during the target proofreading task. This first attribute information directly reflects the target proofreading task's ability to detect and correct errors in the pre-defined erroneous text corresponding to the first reference text. For example, if there are 100 pre-defined erroneous texts corresponding to the first reference text, and 80 are correctly corrected after proofreading, then the text correction rate is 80%.

[0049] In an optional example, the preset error text corresponding to the first reference text can be compared with the proofreading output text output after proofreading the preset error text, and each preset error text can be detected verbatim to see if it is the same as the proofreading output text output after proofreading the preset error text. For example, multiple preset error texts can include "[sky (husband), earth (he), people (enter)]" (the text in parentheses is the preset error text), and the corresponding proofreading output texts for the preset error texts can be "[sky, earth, enter]" in sequence. Then, calculate the number of preset error texts that are correctly corrected among the multiple preset error texts, divide it by the total number of multiple preset error texts to obtain the text correction rate, and thus obtain the first attribute information. In the above example, 2 out of 3 preset error texts are correctly corrected ("husband" and "he" are corrected, "enter" is not corrected), and the text correction rate is 2 / 3 ≈ 66.7%.

[0050] Among them, the second attribute information can be used to indicate the number of words with optical character recognition errors in the multiple second reference texts that are not discovered and corrected after performing the target proofreading task for text correction on the multiple second reference texts. That is to say, it describes how many words with optical character recognition errors in the multiple second reference texts are not discovered and corrected, that is, the number or proportion of the missed words with optical character recognition errors. For example, among 100 words with optical character recognition errors in the second reference texts, if there are 10 such words, and 5 words with optical character recognition errors are not proofread during the execution of the target proofreading task, the situation where these 5 words with optical character recognition errors are not corrected can be reflected by the second attribute information.

[0051] Among them, the third attribute information can describe the situation where, after performing text correction during the process of performing the target proofreading task on the multiple second reference texts, there is doubt about whether the multiple words that have undergone text correction in the multiple second reference texts are correct, but the processing result is unclear or there is a dispute about the processing method. For example, there may be a word that has undergone text correction with multiple possible correct forms, but it is simply marked as uncertain during proofreading and not further processed in depth, and this situation will be recorded by the third attribute information; or, there may be a second reference text that has undergone text correction during the execution of the target proofreading task, but the word that has undergone text correction has a text correction error, and this can also be used to determine the third attribute information at this time.

[0052] By referencing attribute information, the proofreading quality of the target proofreading task during execution can be accurately quantified. Specifically, the first attribute information can use concrete numerical values ​​to intuitively reflect the accuracy and effectiveness of proofreading, facilitating comparison and evaluation between different proofreading tasks or different proofreaders. The second attribute information helps to discover omissions in the proofreading process and promptly identify overlooked errors, which is crucial for improving the quality of the final text, especially in fields with extremely high accuracy requirements, such as legal documents and academic works, where any omission can lead to serious consequences. The third attribute information reveals whether the attitude and methods used to handle uncertain factors during the proofreading process are reasonable. Improper handling of questionable text may affect the credibility and professionalism of the text. This attribute information allows for targeted improvements and standardization, enhancing the overall proofreading level and text quality.

[0053] S120. Perform proofreading task quality inspection on the target proofreading task based on reference attribute information.

[0054] Reference attribute information can be a set of key indicators that comprehensively reflect the performance of proofreading tasks, including text correction rate (first attribute information), text correction omissions (second attribute information), and text correction doubts (third attribute information), etc. These attribute information describe the processing effect of erroneous preset text during the proofreading process from different perspectives.

[0055] For example, the first attribute information indicates a text correction rate of 70%, the second attribute information shows 5 omissions in text correction, and the third attribute information records 3 questionable text corrections. Then, these reference attribute information are evaluated according to preset detection standards or thresholds. Different application scenarios and proofreading requirements will have different standards. For example, in some high-precision academic publication proofreading, a text correction rate of at least 90% may be required, with no more than 1 omission, and questionable cases must be explained in detail and properly handled. If the reference attribute information of the current target proofreading task does not meet these standards, it means that the proofreading task has quality problems. Next, the evaluation results determine whether the target proofreading task is qualified. If the text correction rate is too low, there are too many omissions, and / or questionable cases are not handled properly, then the target proofreading task may be judged as unqualified. For example, if the text correction rate is only 50%, and there are many omissions and unprocessed questionable text, then obviously the target proofreading task has failed to meet the expected quality requirements. Finally, if the proofreading task is satisfactory, relevant data can be recorded for subsequent statistical analysis and quality traceability; if it is unsatisfactory, the problems need to be explained in detail, such as which errors were not corrected, which questionable situations were not handled properly, etc., so as to make targeted improvements and optimizations to the target proofreading task.

[0056] By comparing the various indicators in the reference attribute information with preset standards, the completion quality of the target proofreading task can be accurately judged. This helps to promptly identify weaknesses in the proofreading process, whether it's insufficient proofreading task execution capability, an imperfect proofreading process, or systemic problems in OCR recognition itself. If certain types of errors are frequently overlooked, specific checks for these error types can be added to the proofreading process; if the handling of questionable text is chaotic, standardized procedures and processes for handling questionable text can be established. This continuously improves the efficiency and accuracy of proofreading work, reducing unnecessary duplication of work and errors. This reference attribute information-based detection method effectively reduces text error rates, improves document quality, and avoids various problems caused by text errors.

[0057] The technical solution of this disclosure, by calculating a first attribute information based on determining a preset erroneous text containing errors and a proofread output text containing text output after a target proofreading task is performed on the preset erroneous text, helps to accurately quantify the ability to detect and correct errors during the proofreading process. This allows for targeted improvements to the proofreading task, increasing the accuracy of text correction and avoiding the omission of erroneous text that should have been corrected. For example, a high text correction rate indicates that the proofreading task is effective in correcting errors, while a low rate suggests that the proofreading may be careless or lacking in capability. The second attribute information is used to indicate the omission of text correction when performing a target proofreading task on multiple second reference texts, helping to clarify whether there are any undetected and uncorrected errors in the entire proofreading task. This approach aims to reduce the occurrence of uncorrected errors during the actual proofreading process, thereby improving the completeness and accuracy of the proofreading results. Especially in practical applications, overlooking critical errors can lead to serious consequences. The aforementioned method allows for precise identification of these omissions, enabling timely supplementary proofreading and improvement, significantly enhancing the completeness and reliability of the proofreading. The third attribute information indicates any questionable text corrections when performing text correction on multiple second reference texts, improving the accuracy of the proofreading task. It avoids retaining questionable but actually erroneous text as correct content, or mistakenly treating correct text as incorrect and making unnecessary changes. This provides a clear picture of the handling of uncertain text during the proofreading process, enabling assessment of whether uncertain factors are handled appropriately and preventing improper handling of questionable text from affecting the quality and credibility of the correction. By utilizing at least one of the above attribute information, the quality detection process for the proofreading task can be automated and systematically completed. This allows for targeted optimization and improvement of the proofreading task, continuously enhancing the performance and accuracy of the entire text recognition and proofreading system and reducing the error rate.

[0058] Figure 3 The flowchart of another proofreading task detection method provided by an embodiment of the present disclosure. The technical solution of this embodiment further optimizes the process of determining the reference attribute information of the target proofreading task in the foregoing embodiment on the basis of the technical solution of the foregoing embodiment. This embodiment can be combined with various optional solutions in one or more of the foregoing embodiments.

[0059] As Figure 3 shown, the proofreading task detection method of the embodiment of the present disclosure may include the following process:

[0060] S310. Determine the preset error set and the proofreading output set corresponding to the target proofreading task. The preset error set includes at least one preset error word for performing the target proofreading task, and the proofreading output set includes the proofreading output words corresponding to each preset error word after the target proofreading task is performed on at least one preset error word in the preset error set.

[0061] Among them, referring to Figure 2 , the proofreading output set is a set of words formed by the words corresponding to the proofreading of the preset error words in the preset error set after the target proofreading task is performed. The proofreading output set includes the proofreading output words corresponding to the preset error words. That is to say, during the process of performing the target correction task, each preset error word in the preset error set will be proofread to implement operations such as correction and confirmation of the preset error words. The set of words formed by the finally output proofreading output words after these preset error words are corrected, replaced or confirmed can be used as the proofreading output set. For example, the "fu" (preset error word after wrong preset) in the preset error set is corrected back to "tian" after proofreading, and this "tian" belongs to a word in the proofreading output set that is corrected and replaced for the "fu" in the preset error set.

[0062] In an optional example, during the process of performing the target proofreading task, the target proofreading task can be that a manual proofreader checks, modifies and confirms the target text recognition result corresponding to the target proofreading task word by word according to his own knowledge and experience, or can be completed by an automated proofreading system according to preset proofreading rules and algorithms. During the proofreading process, each preset error word in the preset error set will also be checked, modified and confirmed word by word, or can be completed by an automated proofreading system according to preset proofreading rules and algorithms. For example, when seeing the "ji" (preset error word) in the preset error set, according to the context and correct language knowledge, it is modified to "yi", and this modified "yi" becomes a part of the proofreading output set, and the finally obtained set of proofread words is the proofreading output set.

[0063] By pre-setting errors in the first reference text to create a pre-set error set, known error scenarios can be proactively generated. This provides the conditions for subsequent accurate detection and proofreading quality, simulating the error types that may occur in actual OCR recognition, making the proofreading task more challenging and targeted. Without error pre-setting, proofreading might simply be a matter of confirming the OCR recognition result, making it difficult to discover potential problems. However, with the pre-set error set, it is possible to clearly test whether proofreaders or the system can identify and correct these specific errors, thereby effectively evaluating their proofreading capabilities. By comparing the pre-set error set and the proofreading output set, a comprehensive understanding of the proofreading task's execution can be obtained.

[0064] As an optional but not limited implementation, determining the preset error set and proofreading output set corresponding to the target proofreading task includes the following steps A1-A2:

[0065] Step A1: Obtain at least one first reference character from the target character recognition results corresponding to the target proofreading task. The optical character recognition confidence level of the first reference character is greater than the preset recognition confidence level.

[0066] Step A2: Replace each of the first reference characters included in at least one first reference character with similar-looking characters to obtain the preset error characters associated with each first reference character, so as to obtain the preset error set.

[0067] The first reference text can be reference text identified from a text image through optical character recognition (OCR). The optical character recognition confidence level corresponding to these first reference texts is greater than a preset recognition confidence level, which means that these first reference texts have relatively high recognition reliability. For example, most of the text identified by OCR from a clear scanned page of a book can be used as the first reference text here if its recognition confidence level meets the requirements.

[0068] The preset recognition confidence level can be a pre-defined threshold used to filter out the first reference text with a high confidence level in optical character recognition. Only text with an optical character recognition confidence level greater than this preset confidence level will be identified as the first reference text. The setting of this preset recognition confidence level needs to be determined based on the specific application scenario and the accuracy requirements. For example, in academic document proofreading scenarios where text accuracy is extremely important, the preset recognition confidence level may be set higher, such as 90%; while in ordinary document processing scenarios, it may be set at around 70%.

[0069] Among them, the optical character recognition confidence is an evaluation index of the accuracy of the OCR technology's recognition result, reflecting the probability or credibility of the correctness of the characters recognized by OCR. It is usually calculated based on various factors, such as the matching degree between the image features of the characters and the character features in the training model, the influence of the surrounding text environment, etc. For example, for a clearly printed and interference-free text, its recognition confidence may be as high as 95%, while for some blurred or partially occluded texts, the confidence may be lower.

[0070] In an optional example, perform OCR recognition on a text image containing text (for example, import a scanned copy of a contract into OCR software for recognition). The process of performing OCR recognition on the text image includes at least one of image preprocessing (such as grayscale conversion, noise reduction, binarization, etc.), image feature extraction (extracting features such as strokes and contours of the text), and character classification and recognition (matching and recognizing the extracted features with the characters in the character library). After obtaining the OCR recognition result, the text with an optical character recognition confidence greater than the preset recognition confidence can be screened out as the first reference text. For example, OCR recognizes all the text in a document and gives the recognition confidence of each text. If the preset recognition confidence is 80%, then only the text with a confidence greater than 80% is selected. For example, see Figure 2 It is known that the preset recognition confidence threshold during OCR recognition is θ1, and the OCR recognition result is correct when it is greater than the preset recognition confidence threshold θ1. Before issuing and executing the target proofreading task, obtain n OCR recognition results with an optical character recognition confidence greater than the preset recognition confidence threshold θ1 as the first reference text with high credibility, which is used to form corresponding preset error texts for error presetting.

[0071] By screening the first reference text with a recognition confidence greater than the preset recognition confidence, the excessive interference factors introduced due to OCR recognition errors can be reduced. In this way, in subsequent tasks of replacing similar characters and proofreading, more focus can be placed on the detection and processing of specific types of errors (similar character errors), rather than being troubled by a large number of misrecognized texts with low confidence. For example, if confidence screening is not performed, there may be many low-confidence texts with obvious errors, such as misrecognizing '日' as '曰' with a very low confidence, which will make subsequent tasks of replacing similar characters and proofreading complicated and difficult to accurately evaluate the proofreading effect. Only processing the first reference text with high confidence can reduce unnecessary consumption of computing and processing resources, because low-confidence texts may require more manual intervention or complex secondary recognition processing, and discarding these texts can make the entire process run more efficiently and quickly enter the core link of the task of replacing similar characters and proofreading.

[0072] Among them, the substitution of similar-looking characters can be used for the text processing operation of replacing the first reference text with a preset incorrect text that is similar in font to the first reference text. Similar-looking characters refer to characters that are similar in font structure but may have different meanings and pronunciations. For example, "天" can be replaced with "夫", "未" can be replaced with "末", etc. By substituting similar-looking characters, incorrect texts can be deliberately introduced in the target proofreading task to detect the recognition and correction capabilities of such incorrect texts when performing the target proofreading task. For each first reference text obtained in the previous step, find the similar-looking characters of each first reference text. For example, for the first reference text "大", its similar-looking characters include "太", "犬", etc. Replace the first reference text with its similar-looking character to obtain the corresponding preset incorrect text. For example, replace "大" with "太", and this "太" is the preset incorrect text associated with "大". Repeat this process to perform substitution of similar-looking characters on all the first reference texts, and finally obtain a preset incorrect set formed by the preset incorrect texts corresponding to multiple first reference texts. For example, if the first reference texts are "天", "地", "人", after substitution of similar-looking characters, the preset incorrect set formed by the preset incorrect texts corresponding to each first reference text is "夫", "他", "入" (which are the results of substituting similar-looking characters for "天", "地", "人" respectively).

[0073] By substituting similar-looking characters, it is possible to effectively simulate a common type of error situation that may occur in the OCR recognition process. In actual OCR applications, due to factors such as font and image quality, confusion of similar-looking characters is a frequently occurring error type. By actively substituting similar-looking characters, a similar error environment can be created in the proofreading task, thereby testing the sensitivity and correction capabilities of the target proofreading task for such errors.

[0074] As an optional but non-limiting implementation, substituting similar-looking characters for each of the first reference texts included in at least one first reference text includes the following steps A21 - A22:

[0075] Step A21: For each first reference text, obtain a list of similar-looking characters of the first reference text from the similar-looking character mapping table, where the characters included in the list of similar-looking characters of the first reference text are similar in font to the first reference text and meet the preset font similarity.

[0076] Step A22: Replace the first reference text with the character at the random number serial number position in the list of similar-looking characters of the first reference text to obtain the preset incorrect text associated with each first reference text.

[0077] Among them, the similar-character mapping table can be a pre-constructed data table or data structure that stores a large amount of relationship information between similar characters of Chinese characters. The similar-character mapping table can associate and organize characters with similar glyph structures according to certain rules and algorithms, so that the similar characters of a certain Chinese character can be quickly queried and obtained when needed. For example, for the character "天" (tiān), in the similar-character mapping table, it may be associated with characters such as "夫" (fū), "夭" (yāo), "太" (tài), etc. These characters have a high degree of similarity with the character "天" in terms of strokes, radicals, structures, etc.

[0078] Among them, the preset glyph similarity can be a pre-set measurement standard or threshold used to determine which characters can be regarded as similar characters of the first reference character and included in its list of similar characters. This preset glyph similarity can be defined based on various glyph features, such as the range of differences in the number of strokes, the sameness or similarity of radicals, the degree of similarity of the overall structure, etc. For example, if it is set that characters with a difference in the number of strokes within 1 - 2 and the same or similar radicals are similar characters with the preset glyph similarity, then for the character "日" (rì), characters such as "目" (mù), "白" (bái) may be determined as its similar characters and included in the list of similar characters because they meet this condition, while the character "月" (yuè) may not meet the requirements due to a relatively large difference in strokes.

[0079] In an optional example, using each first reference character as a keyword, search in the similar-character mapping table for characters that meet the preset glyph similarity with the first reference character in terms of glyph. For example, when the first reference character is "木" (mù), search for the information of similar characters associated with "木" in the similar-character mapping table. According to the criteria of the preset glyph similarity, screen and match the characters in the mapping table, and only the characters that meet the requirements of the preset glyph similarity with the first reference character in terms of glyph will be included in the list of similar characters of this first reference character. For example, for the character "木", characters such as "本" (běn), "末" (mò), "未" (wèi) may be selected into its list of similar characters because they are similar to "木" in terms of strokes and structure, while the character "水" (shuǐ) will not be included because of the large glyph difference from "木".

[0080] By combining the use of the similar-character mapping table and the preset glyph similarity, the list of similar characters of each first reference character can be obtained efficiently and accurately, providing a rich and highly targeted text resource for subsequent error pre-setting (similar-character substitution), making the simulated errors more in line with the actual possible similar-character confusion situations. Screening similar characters based on the preset glyph similarity can adjust the accuracy and scope of error simulation according to different application requirements and text characteristics. For some fields with high requirements for text accuracy and a relatively high risk of similar-character confusion, a higher glyph similarity standard can be set, and only characters that are very similar to the first reference character are selected as similar characters, so as to more accurately simulate the possible errors in these key fields.

[0081] Among them, after obtaining the list of similar-looking characters of the first reference character, this random number serial number is used to determine which similar-looking character to select from the list to replace the first reference character, thereby increasing the randomness and uncertainty of the replacement operation and better simulating various possible error situations. For example, if there are 5 characters in the list of similar-looking characters of a certain first reference character and the generated random number serial number is 3, then the 3rd character in the list is selected to replace the first reference character.

[0082] In an optional example, after obtaining the list of similar-looking characters of the first reference character, a random number generator is used to generate a random number serial number; for example, the generated random number is within the range of 1 to the length of this list of similar-looking characters. Furthermore, according to this random number serial number (such as [5, 10]), the preset error character corresponding to the first reference character can be taken out from the list of similar-looking characters and the first reference character is replaced with the preset error character; for example, the list of similar-looking characters of the first reference character "wood" is ["book", "end", "not yet"], if the random number serial number is 2, then the character "end" is selected to replace "wood", and "end" is obtained as the preset error character associated with "wood". In this way, by repeating this process for each first reference character, a series of preset error characters after replacement with similar-looking characters can be obtained.

[0083] Using the random number serial number to select similar-looking characters for replacement avoids the fixed-mode error setting, making the preset error characters generated each time have uncertainty. In this way, during the proofreading process, it can more realistically simulate the random similar-looking character errors that may occur in actual OCR recognition, increasing the difficulty and comprehensiveness of the proofreading task. By performing such random similar-looking character replacement on each first reference character to obtain preset error characters and proofreading these preset error characters, the proofreading ability can be evaluated from multiple dimensions. The correction rate, omission rate, and doubtful handling situation of similar-looking character errors during the proofreading process can be counted, so as to deeply understand the recognition, correction, and handling abilities of similar-looking character errors when performing the target proofreading task.

[0084] As an optional but non-limiting implementation manner, after performing similar-looking character replacement on each of the first reference characters included in at least one first reference character to obtain the preset error characters associated with each first reference character, the following steps are further included:

[0085] Bind a preset character attribute to the preset error character corresponding to each first reference character obtained by performing similar-looking character replacement on each first reference character. The preset character attribute is used to screen and extract each preset error character participating in the evaluation of the character correction rate from at least one preset error character corresponding to the target proofreading task.

[0086] Among them, the preset text attribute can be an attribute identifier or feature marker that is preset and assigned to the preset incorrect text obtained by performing a substitution of similar-looking characters on the first reference text. By binding the preset text attribute to the preset incorrect text, a correlation relationship is established between the preset text attribute and the preset incorrect text, enabling the preset incorrect text to carry this attribute information. This is similar to adding a tag with a specific meaning to each preset incorrect text, facilitating the quick identification and processing of these texts based on this tag during subsequent processing.

[0087] In an optional example, after performing a substitution of similar-looking characters on each first reference text to obtain the preset incorrect text, the preset text attribute corresponding to each preset incorrect text is determined according to pre-set rules and classification criteria. Taking the first reference text as "日" (sun), the first reference text is replaced with "目" (eye), and this preset text attribute is assigned to the replaced preset incorrect text "目". Through programming or data structure design, etc., the determined preset text attribute is associated and stored with the corresponding preset incorrect text at the data level, so that the bound preset text attribute information can be obtained simultaneously when processing the preset incorrect text. For example, in a data list, the preset incorrect text "目" and its corresponding attribute are stored as a group of data. By filtering and extracting the preset incorrect text through the preset text attribute, specific types of texts can be selected from numerous preset incorrect texts to form a targeted proofreading subset according to different proofreading objectives and requirements.

[0088] S320. Determine the first attribute information in the reference attribute information of the target proofreading task based on the preset incorrect set and the proofreading output set. The first attribute information is used to indicate the text correction rate situation for each preset incorrect text included in the preset incorrect set when generating the proofreading output set by performing text correction on each preset incorrect text in the preset incorrect set through executing the target proofreading task.

[0089] Among them, the reference attribute information includes at least one of the first attribute information, the second attribute information, and the third attribute information. The first attribute information is used to indicate the text correction rate situation when performing text correction on at least one preset incorrect text corresponding to the target proofreading task. Each preset incorrect text is formed by incorrectly presetting the first reference text. The second attribute information is used to indicate the text correction omission situation when performing text correction on multiple second reference texts through executing the target proofreading task. The third attribute information is used to indicate the text correction doubt situation when performing text correction on multiple second reference texts through executing the target proofreading task. The first reference text and the second reference text are texts in the target text recognition result obtained by performing optical character recognition corresponding to the target proofreading task.

[0090] As an optional but non-limiting implementation, determining the first attribute information in the reference attribute information of the target proofreading task based on the preset error set and the proofreading output set includes the following steps B1 - B3:

[0091] Step B1: Compare the preset error texts in the preset error set with the proofreading output texts corresponding to the preset error texts in the proofreading output set after proofreading.

[0092] Step B2: Determine the text correction status of each preset error text in the preset error set according to the text comparison result between the preset error text and the proofreading output text corresponding to the preset error text after proofreading. The text correction status of the preset error set is used to describe whether the preset error texts in the preset error set are corrected when performing the target proofreading task on the preset error set to generate the proofreading output set.

[0093] Step B3: Determine the first attribute information in the reference attribute information of the target proofreading task according to the text correction status of each preset error text in the preset error set.

[0094] Among them, text comparison can refer to comparing each preset error text in the preset error set with the corresponding proofreading output text in the proofreading output set word by word to check whether there are differences between the preset error text and the corresponding proofreading output text in the proofreading output set, so as to determine whether the preset error text formed by the wrong preset is corrected. For example, if the preset error text is "夫" and the corresponding proofreading output text is "天", it can be found that the two are different through comparison, indicating that there is a text correction behavior. Compare the preset error text with the proofreading output text obtained after proofreading the preset error text in detail to check whether they are the same in terms of glyph, meaning, etc., record the comparison result, and so on, and repeat this operation for each first word in the preset error set.

[0095] Among them, see Figure 2 , the text correction status can be used to characterize whether each preset error text is modified to the correct text after passing through the target proofreading task when performing the target proofreading task on the preset error set to generate the proofreading output set. Usually, it can be divided into states such as corrected and uncorrected. For example, if the preset error text is a wrong character and becomes a correct character after proofreading, its text correction status is corrected; if it is still a wrong character or remains unchanged after proofreading, it is uncorrected.

[0096] Among them, the first attribute information is used to indicate the character correction rate of the preset error texts in the preset error set when performing the target proofreading task on the preset error set to generate a proofreading output set, which can be obtained by statistically calculating the character correction status of the preset error texts. The character correction rate of the preset error texts in the preset error set when performing the target proofreading task on the preset error set to generate a proofreading output set indicated by the first attribute information is determined based on the ratio of the number of preset error texts that have been corrected in the preset error set to the total number of preset error texts in the preset error set. For example, for m1 preset error texts in the preset error set, where n1 preset error texts belong to the preset error texts with the character correction status of "corrected" in the preset error set, the character correction rate of the preset error texts in the preset error set when performing the target proofreading task on the preset error set to generate a proofreading output set indicated by the first attribute information is obtained according to n1 / m1.

[0097] As an optional but non-limiting implementation manner, according to the character comparison result between the preset error text and the proofreading output text corresponding to the preset error text after proofreading, the character correction status of each preset error text in the preset error set is determined, including the following steps B21 - B22:

[0098] Step B21: If the preset error text is the same as the proofreading output text corresponding to the preset error text after proofreading, it is determined that the preset error text in the preset error set has not been corrected when performing the target proofreading task on the preset error set to generate a proofreading output set.

[0099] Step B22: If the preset error text is different from the proofreading output text corresponding to the preset error text after proofreading, it is determined that the preset error text in the preset error set has been corrected when performing the target proofreading task on the preset error set to generate a proofreading output set.

[0100] Exemplarily, see Figure 2 , and judge the character correction status of the preset error text corresponding to each first reference text according to the character comparison result. If the preset error text is the same as the corresponding proofreading output text, it means that the preset error text has not been corrected and its character correction status is "not corrected"; if the two are different, it indicates that the preset error text has been corrected and the character correction status is "corrected". For example, for the character "夫", since it is different from "天", its character correction status is "corrected"; if there is a preset error text that is consistent with its corresponding proofreading output text, such as the preset error text is "地" and the proofreading output text is also "地", then the character correction status of this preset error text is "not corrected".

[0101] S330. Perform quality detection of the proofreading task on the target proofreading task based on the reference attribute information.

[0102] The technical solution of this disclosure, based on the aforementioned technical effects, can simulate real error scenarios by replacing the first reference text with similar-looking characters. Since similar-looking character confusion is common in OCR, replacement makes the proofreading task closer to the complexity of reality, comprehensively testing the ability to identify and correct such errors during the execution of the proofreading task, improving the comprehensiveness of proofreading, and facilitating the discovery of potential text accuracy problems to ensure output quality. Moreover, replacing the first reference text with similar-looking characters facilitates the quantitative evaluation of proofreading capabilities. By statistically analyzing the correction of preset erroneous characters, multi-dimensional quantitative indicators such as text correction rate can be obtained, providing objective data for evaluating the performance of the proofreading task, so as to conduct targeted training or system optimization. Furthermore, replacing the first reference text with similar-looking characters can also optimize the OCR algorithm and proofreading process. Based on the attribute information after proofreading, OCR problems can be traced and their feature extraction and other links can be optimized. The proofreading steps, methods and review mechanisms can also be adjusted to address the deficiencies in the proofreading process, forming a virtuous cycle to improve the efficiency and accuracy of text recognition and proofreading work, meeting the application scenarios with high accuracy requirements. Furthermore, text comparison can accurately identify differences between preset erroneous text and the proofread output text, providing a reliable basis for subsequent judgments on whether text has been corrected. Whether the error correction is due to the substitution of similar-looking characters or text changes caused by other reasons, it can be accurately captured, avoiding misjudgments and ensuring the accuracy of the entire proofreading and detection process. Accurate comparison reveals which text has been modified and which has not, thus determining the text correction status of each preset erroneous text, providing a more detailed dimension for subsequent calculation of primary attribute information and in-depth analysis of proofreading quality. In addition to the overall text correction rate, further analysis can be conducted on the distribution of uncorrected text and the types of corrected text, to gain a more comprehensive understanding of the advantages and disadvantages of the proofreading work. Moreover, the text correction rate indicated by the primary attribute information quantitatively and intuitively reflects the overall effectiveness of the proofreading task in correcting errors. A higher text correction rate indicates that the proofreading work can effectively identify and correct preset erroneous text, while a lower rate suggests potential deficiencies in the proofreading work. For proofreading tasks with low text correction rates, it can be determined whether reproofing is necessary, thereby enabling targeted optimization and upgrades to improve the performance and reliability of the proofreading task.

[0103] Figure 4 This is a flowchart illustrating another proofreading task detection method provided in this embodiment. The technical solution of this embodiment further optimizes the process of determining reference attribute information based on a preset error set and a proofreading output set in the previous embodiments, based on the technical solutions of the above embodiments. This embodiment can be combined with various optional solutions in one or more of the above embodiments.

[0104] like Figure 4As shown, the proofreading task detection method of the embodiments of the present disclosure may include the following processes:

[0105] S410. Determine at least one third reference text from multiple second reference texts, where each third reference text satisfies the following condition: after performing the target proofreading task, the third reference text is the same as the text corresponding to the output after proofreading the third reference text. The multiple second reference texts include all the texts in the target text recognition result corresponding to the target proofreading task or the remaining texts in the target text recognition result corresponding to the target proofreading task except the first reference text.

[0106] Among them, referring to Figure 5 , all the texts in the target text recognition result corresponding to the target proofreading task can be used as the multiple second reference texts, or the remaining texts in the target text recognition result corresponding to the target proofreading task except the first reference text can be used as the multiple second reference texts. Each third reference text can be a specific text selected from the multiple second reference texts. The characteristic of the third reference text is that after the target proofreading task is executed, the third reference text is the same as the text corresponding to the output after proofreading, that is, it is not modified by proofreading when the target proofreading task is executed. In an optional example, compare each second reference text in the multiple second reference texts with the text corresponding to the output after proofreading the second reference text one by one to determine whether the second reference text is the same as the text corresponding to the output after proofreading the second reference text. For example, the multiple second reference texts are ["husband", "earth", "enter"], and the texts corresponding to the output after proofreading each second reference text are ["sky", "earth", "person"]. Compare "husband" with "sky", "earth" with "earth", and "enter" with "person". These identical second reference texts can be used as the third reference texts.

[0107] S420. Determine the optical character recognition error probability associated with each third reference text. The optical character recognition error probability of the third reference text is used to indicate the probability of an optical character recognition error when obtaining the third reference text by performing optical character recognition on a text image.

[0108] Among them, referring to Figure 5 , in the entire proofreading task process, the third reference text is selected from the multiple second reference texts. The optical character recognition error probability associated with each third reference text can be a quantitative evaluation of the possibility of an error occurring in the OCR recognition process of the third reference text. The optical character recognition error probability can be calculated based on multiple factors, such as the font style of the text, the image clarity, the recognition accuracy of the OCR algorithm for this type of text, etc.

[0109] In an optional example, referring to Figure 5, for each determined third reference character, the optical character recognition error probability of this third reference character can be obtained according to a pre-established optical character recognition error probability database or calculation model for reference characters. For example, by querying the database, it is known that the OCR recognition error probability of the character "地" under specific font and image conditions is 5%. This operation is performed on each third reference character.

[0110] Determining the optical character recognition error probability associated with the third reference characters helps to analyze the possible problems of the uncorrected characters at the source. Since these third reference characters were not modified during proofreading, understanding their error probability during OCR recognition can determine the reliability of the initial recognition of the third reference characters as correct. For example, if a third reference character has a high optical character recognition error probability, then this third reference character is likely to be an originally incorrect character that was not detected during proofreading; conversely, if the error probability is low, it may be that the character itself is correct and it is reasonable that it was not modified during proofreading, which provides a strong clue for further evaluating the nature and cause of the proofreading omissions.

[0111] As an optional but non-limiting implementation, determining the optical character recognition error probability associated with each third reference character includes the following steps C1 - C2:

[0112] Step C1: Determine the character attribute data of each third reference character. The character attribute data of the third reference character includes the reference character image to which the third reference character belongs, the third reference character, and the optical character recognition confidence of the third reference character. The reference character image to which the third reference character belongs is the text image that can be obtained by performing optical character recognition to input the third reference character;

[0113] Step C2: Determine the optical character recognition error probability associated with each third reference character according to the character attribute data of each third reference character.

[0114] Among them, the character attribute data can be a set of comprehensive information sets used to describe the relevant characteristics of the third reference character, covering multiple key aspects, including the reference character image to which the third reference character belongs (that is, the image containing text from which this character was initially recognized through OCR), what specific recognized character the third reference character itself is, and the optical character recognition confidence of the third reference character (an evaluation index representing the accuracy of the OCR system in recognizing this character). The reference character image can refer to the image material carrying the third reference character before performing the optical character recognition operation. For example, a scanned picture of a page of a paper book, a product packaging photo with text descriptions, etc. These images are subjected to OCR to recognize the corresponding text, and the third reference character mentioned here is the specific text character extracted from such an image.

[0115] In one optional example, see Figure 5 The third reference text refers to the unmodified second reference texts that are identical to the corresponding proofreading output text after multiple second reference texts have undergone a target proofreading task. Each third reference text corresponds to an original reference text image, meaning it's necessary to identify the specific image from which the third reference text was initially obtained through OCR recognition. Simultaneously, the optical character recognition confidence score of the third reference text is obtained. This confidence score is the accuracy assessment value given by OCR when recognizing the text. Integrating all this information constitutes the text attribute data of the third reference text.

[0116] In one optional example, see Figure 5 For each third reference text for which text attribute data has been acquired, the probability of optical character recognition (OCR) error is calculated based on its text attribute data. The specific calculation method can be based on various factors and pre-defined models and algorithms. For example, one approach might be: if the confidence level of the third reference text's OCR is low (e.g., below 0.7, or 70%), and the image of the reference text to which the third reference text belongs has poor clarity, text occlusion, or interference (judged through image quality analysis of the reference text image to which the third reference text belongs), then the OCR error probability of that third reference text can be increased accordingly; conversely, if the confidence level is high and the image quality is good, then its error probability can be decreased. Furthermore, adjustments can be made based on historical data statistics. For instance, if errors have historically occurred frequently in OCR recognition of text with the same font and similar layout, then even if the current confidence level of this third reference text is acceptable, its OCR error probability can be appropriately increased. By comprehensively considering various factors in the text attribute data and historical experience, the final OCR error probability associated with each third reference text is determined.

[0117] By identifying text attribute data, we can fully trace the source of each third reference text and its key information during OCR recognition. This provides a comprehensive and detailed data foundation for in-depth analysis of the third reference text's performance throughout the entire process, avoiding the one-sidedness of focusing solely on the text itself while ignoring its background (such as image conditions and confidence levels during recognition). This integrated text attribute data can then be used for targeted analysis of different third reference texts. Determining the probability of optical character recognition errors quantifies the risk of errors occurring in the OCR recognition process. This is more accurate and intuitive than simple qualitative judgments (such as knowing only that the text may have errors or the image is not clear), providing concrete numerical evidence for evaluating the quality of proofreading tasks and analyzing text correction omissions.

[0118] As an optional but not limited implementation, the optical character recognition error probability associated with each third reference character is determined based on the text attribute data of each third reference character, including steps D1-D2:

[0119] Step D1: Determine the first similarity associated with each third reference character. The first similarity is determined based on the image similarity between each third reference character and the reference character image to which the third reference character belongs.

[0120] Step D2: Determine the optical character recognition error probability associated with each third reference character based on the first similarity associated with each third reference character and the optical character recognition confidence of each third reference character.

[0121] Among them, the first similarity of the third reference text association is an indicator used to measure the degree of similarity between the third reference text and a specific element. Specifically, it refers to the image similarity between the third reference text and the reference text image to which the third reference text belongs. In other words, it describes the degree of fit and similarity between the appearance of the third reference text and the overall features of the text image at the image level, so as to comprehensively judge whether the third reference text is prone to recognition errors.

[0122] See Figure 5 For each third reference text, the reference text image to which the third reference text belongs is located. Then, the image similarity between the third reference text and the reference text image to which the third reference text belongs is calculated using certain image analysis algorithms and techniques. These algorithms may consider multiple dimensions of factors such as the position of the text in the image, the contrast between the text strokes and the image background, the clarity of the text, and the correlation between the text and the surrounding image content.

[0123] See Figure 5For each third reference character whose first similarity score and optical character recognition confidence score have been obtained, its optical character recognition error probability needs to be determined comprehensively based on these two indicators. Certain rules or models exist for this comprehensive judgment. If a third reference character has a low first similarity score and a low optical character recognition confidence score, the probability of an optical character recognition error is often relatively high; conversely, if the first similarity score is high and the optical character recognition confidence score is also at a high level, the error probability is relatively low. This can be achieved through pre-defined calculation formulas, logical judgment rules, or machine learning models trained on existing data. Using the first similarity score and optical character recognition confidence score as input parameters, and after appropriate calculation or judgment processes, the final output is the optical character recognition error probability value associated with each third reference character, thereby quantifying the potential degree of error.

[0124] As an optional but not limited implementation, determining the first similarity for each third reference text association includes the following steps E1-E3:

[0125] Step E1: Determine the glyphs of each font associated with each third reference text, and determine the font images of each font associated with the third reference text based on the glyphs of each font associated with the third reference text.

[0126] Step E2: Determine the second similarity between the reference text image to which the third reference text belongs and each font text image associated with the third reference text;

[0127] Step E3: Determine the first similarity of each third reference text based on the second similarity.

[0128] See Figure 6 For the various font glyphs associated with a third reference text, since text often exists in multiple font styles (such as Song, Hei, Kai, etc.), each font glyph associated with a third reference text refers to the specific shape and style of that third reference text in different fonts. Based on the aforementioned font glyphs of the third reference text in different fonts, text image data is created or acquired for each font glyph through preset methods (such as using text editing software to generate corresponding image formats for different fonts, or extracting them from existing document images containing text of that font). These text images can then serve as the font text images associated with the third reference text. See also Figure 6The second similarity can refer to the degree of similarity between the reference text image to which the third reference text belongs (i.e., the original image that initially identified the third reference text through OCR) and the various font text images related to the third reference text. By comparing the similarity of these two types of images in many aspects (such as the shape of the character strokes, the layout of the characters in the image, the background features of the image, etc., as long as it involves factors that can be compared at the image level), a numerical value is obtained to represent the degree of similarity between the reference text image to which the third reference text belongs and the various font text images related to the third reference text.

[0129] In one optional example, see Figure 6 The process of identifying the glyphs associated with each third reference text includes: recognizing all font styles of the third reference text corresponding to the target proofreading task; obtaining font files of the font types involved in the third reference text corresponding to the target proofreading task, based on differences in at least one of stroke width, stroke direction, or ligature connection, as a set of associated fonts; and then reading the glyphs associated with each third reference text from the set of associated fonts. Stroke width refers to the thickness of the strokes when writing or drawing text; stroke direction refers to the direction and order of the strokes; and ligature connection mainly describes the way and position of the connection between strokes during the writing process.

[0130] In one optional example, see Figure 6 The process involves determining the text images of each font related to the third reference text based on the font glyphs associated with the third reference text. This includes: initializing a white background image of a specified size for each font glyph associated with the third reference text; rendering black glyphs at a unified starting coordinate point on the white background image for each font glyph associated with the third reference text; identifying the black and white color difference; finding the smallest rectangular area covered by the glyphs and cropping it; and extending the smallest side of the cropped image outward from the image center to be equal to the longest side, resulting in a text image of uniform size and centered glyphs as the text image of each font glyph associated with the third reference text.

[0131] In one optional example, see Figure 6The method involves determining the second similarity between the reference text image to which the third reference text belongs and each font text image associated with the third reference text. This includes: extracting image features from the reference text image to which the third reference text belongs using an inverse residual structure and a depthwise separable neural network model (such as a deep learning model for calculating image similarity), obtaining the image feature vector corresponding to the reference text image; and extracting image features from each font text image associated with the third reference text using the same inverse residual structure and depthwise separable neural network model (such as a deep learning model for calculating image similarity), obtaining the image feature vector corresponding to each font text image associated with the third reference text. Then, iterating through the image feature vectors corresponding to each font text image, calculating the cosine similarity between the image feature vector corresponding to each font text image and the image feature vector corresponding to the reference text image, and using each cosine similarity as a second similarity. Optionally, determining the first similarity associated with each third reference text based on the second similarities includes: determining the maximum similarity among the second similarities as the first similarity associated with each third reference text, or determining the average similarity among the second similarities as the first similarity associated with each third reference text.

[0132] As an optional but not limited implementation, the optical character recognition error probability associated with each third reference character is determined based on the first similarity associated with each third reference character and the optical character recognition confidence of each third reference character, including the following steps E31-E32:

[0133] Step E31: Input the first similarity associated with each third reference character and the optical character recognition confidence of each third reference character into the character recognition error prediction model. The character recognition error prediction model is a calculation model trained based on a binary classification logistic regression model and used to determine whether the characters output by optical character recognition of text images are correct.

[0134] Step E32: Determine the probability of optical character recognition error associated with each third reference character using the character recognition error prediction model.

[0135] Among them, see Figure 7The character recognition error prediction model is a pre-built computational model used to determine whether the text output after optical character recognition (OCR) of a text image is correct. It belongs to the category of models capable of classification. The character recognition error prediction model is trained based on a binary logistic regression model, which means that the model learns from a large amount of pre-labeled (correct or incorrect) sample data to find the feature patterns contained in the data. In this case, it establishes a correlation between the input data (in this case, the first similarity and optical character recognition confidence) and the correctness of the text, thereby enabling it to predict the correctness of the text corresponding to new input data.

[0136] See Figure 7 After receiving the initial similarity and optical character recognition confidence scores, the character recognition error prediction model processes the input data according to the logical rules and mathematical relationships learned from its internal binary logistic regression model training. For example, the model's algorithm considers the interaction between these two input values ​​and the text-related features they reflect, comparing them with previously learned probability distributions of correct or incorrect characters at different similarity and confidence levels to calculate the predicted optical character recognition error probability of the third reference character corresponding to the current input. Finally, it outputs the optical character recognition error probability value associated with that third reference character. This process is repeated for each third reference character to obtain its own corresponding optical character recognition error probability.

[0137] By simultaneously inputting the first similarity score and the optical character recognition confidence score into the model, a comprehensive consideration of the third reference text from different dimensions is achieved. Specifically, the first similarity score focuses on reflecting the text's features in the original image and its association with the overall image, while the optical character recognition confidence score reflects the judgment of text correctness given by the OCR algorithm and internal mechanisms such as text feature matching. Combining the two can more comprehensively and meticulously depict the true state of the third reference text, avoiding the one-sidedness and limitations that may occur when relying on a single indicator to judge the correctness of text, making subsequent correctness predictions more reasonable and accurate. Leveraging the powerful learning ability of the character correctness prediction model trained on a binary logistic regression model, the model can effectively uncover the complex nonlinear relationship between the two input indicators (first similarity score and optical character recognition confidence score) and text correctness. By learning from a large amount of diverse sample data, the model has mastered the probabilistic tendencies of text correctness or error under different combinations, thus enabling more scientific and accurate subsequent correctness prediction operations based on new input data, significantly improving the reliability and efficiency of judging text correctness.

[0138] In one optional example, see Figure 7The construction process of the character recognition error prediction model is as follows: A predetermined number of candidate text datasets are constructed. These datasets include candidate text images, candidate OCR characters obtained from OCR recognition of the candidate text images, and candidate OCR recognition confidence scores. Candidate OCR recognition accuracy is labeled for each candidate OCR character in the dataset to indicate whether the OCR recognition is correct. Image similarity is calculated between each candidate text image and the candidate OCR characters obtained from OCR recognition of those images, yielding the candidate OCR image similarity associated with each candidate OCR character. The candidate OCR recognition confidence scores, accuracy scores, and candidate OCR image similarities for each candidate OCR character constitute the text dataset used for model training. 75% of the text dataset used for model training is randomly selected as the training set, 15% as the validation set, and 15% as the test set to construct a binary classification logistic regression model. Based on the text dataset used for model training, candidate OCR recognition confidence and candidate OCR image similarity are used as inputs, and candidate OCR recognition correctness is used as output. The binary cross-entropy loss function (BCELoss) is used to train the model. At the end of each epoch, real-time monitoring is performed based on the validation set data, and finally, the model is evaluated based on the test set data.

[0139] S430. Based on the optical character recognition error probability associated with each third reference character, determine the second attribute information in the reference attribute information of the target proofreading task.

[0140] The reference attribute information includes at least one of a first attribute information, a second attribute information, and a third attribute information. The first attribute information is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is a text formed by performing error pre-setting on the first reference text. The second attribute information is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute information is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are the texts in the target text recognition results obtained by performing optical character recognition corresponding to the target proofreading task.

[0141] Among them, the optical character recognition error probabilities of each third reference character can be comprehensively considered, and the optical character recognition error probabilities of each third reference character can be sorted. The second attribute information is determined based on the optical character recognition error probability at the median; or the second attribute information is determined based on the average value of the optical character recognition error probabilities of each third reference character. Among them, the second attribute information is used to describe the character correction omission rate caused by character correction omission when performing the target proofreading task on multiple third reference characters to generate proofread output characters. For example, there are three third reference characters, namely "ground", "person", and "mountain", and the optical character recognition error probabilities of the third reference characters are 5%, 3%, and 8% respectively. The second attribute information can be determined in various ways, such as calculating the average value, sum of these error probabilities, or calculating the weighted average according to a certain weight. In this way, the second attribute information (a quantitative representation of the character correction omission situation) is (5% + 3% + 8%) / 3 = 5.33%, which reflects the average probability situation of possible OCR recognition errors of the original characters of the uncorrected characters as a whole, and can be used to measure the omission degree of possible incorrect characters in the proofreading task. By comprehensively processing the optical character recognition error probabilities of each third reference character, the situation of character correction omission in the proofreading task can be reflected as a whole.

[0142] In an optional example, after obtaining the optical character recognition error probabilities of each third reference character, the optical character recognition error probability of the third reference character can be used as the correction omission index of the third reference character, and based on this, the character correction omission rate when performing the target proofreading task on multiple second reference characters to generate a proofread output set is calculated. Assume that there are N third reference characters among the multiple second reference characters for performing the target proofreading task, the optical character recognition error probability of each third reference character is f, and the first similarity of each third reference character is score

[0145] , The optical character recognition confidence associated with each third reference character is prob i , then the formula for calculating the correction omission rate is as follows:

[0143] <0OO0330>

[0144] S440. Perform proofreading task quality detection on the target proofreading task based on the reference attribute information.

[0145] The technical solution disclosed herein, in addition to achieving the aforementioned technical effects, can also accurately identify uncorrected characters during the proofreading process by determining the third reference character, and determine the optical character recognition error probability of the third reference character. This helps to trace potential problems with these uncorrected characters in the initial OCR recognition stage. Furthermore, based on the optical character recognition error probability, the risk of missing character corrections in the proofreading task can be quantitatively assessed. The second attribute information determined according to the optical character recognition error probability of each third reference character can reflect the overall situation of missing character corrections in the proofreading task. In addition, the second attribute information can intuitively show the degree of neglect of potentially erroneous characters during the proofreading process, so as to make timely targeted adjustments to the proofreading task execution process, thereby improving the efficiency and quality of the entire proofreading process.

[0146] Figure 8 This is a flowchart illustrating another proofreading task detection method provided in this embodiment. The technical solution of this embodiment further optimizes the process of determining the reference attribute information of the target proofreading task in the foregoing embodiments based on the technical solutions of the above embodiments. This embodiment can be combined with various optional solutions in one or more of the above embodiments.

[0147] like Figure 8 As shown, the proofreading task detection method of this disclosure embodiment may include the following process:

[0148] S810. Determine at least one fourth reference character from a plurality of second reference characters, wherein each fourth reference character satisfies the following condition: after the target verification task is performed, the fourth reference character is not the same as the character output after the fourth reference character is verified, and the plurality of second reference characters include all characters in the target character recognition result corresponding to the target verification task or the remaining characters in the target character recognition result corresponding to the target verification task other than the first reference character.

[0149] The fourth reference text can be a specific set of texts generated after filtering multiple second reference texts. Each fourth reference text is different from the proofread output text corresponding to the fourth reference text, meaning that these texts were modified during the proofreading process and are the part of the text for which proofreading took place.

[0150] In one optional example, the text corresponding to the proofreading output obtained after performing the target proofreading task on the second reference text is obtained, and the second reference text is compared with the text corresponding to the proofreading output obtained after performing the target proofreading task on the second reference text one by one. The second reference text that is different after comparison is selected as the fourth reference text.

[0151] S820. Determine the correction doubt results for each fourth reference text. The correction doubt results are used to indicate whether there is any doubt about the correctness of the correction when performing text correction on the fourth reference text in the target proofreading task.

[0152] Among them, see Figure 9 The "Correction Questionable Result" is a judgmental description given for each fourth reference text. It indicates whether there is any doubt about the correctness of the correction when performing text correction on the target proofreading task for the fourth reference text. For example, if it is uncertain whether the modified result of the fourth reference text is necessarily correct, or if there are some ambiguities in handling this text correction during the proofreading task, then the "Correction Questionable Result" will be marked as questionable. The third attribute information is determined through a comprehensive analysis of the "Correction Questionable Result" for each fourth reference text. It reflects the execution status and quality of the proofreading task from the perspective of "Correction Questionable Result," clarifying the reliability and certainty of the corrections made to the modified text during the proofreading process.

[0153] In an optional example, for each specific fourth reference text, the questionable correction result needs to be determined based on the specific circumstances. For instance, if it is a manual proofreading process, the proofreader, based on their professional knowledge, understanding of the context, and text usage standards, reviews the process of proofreading and modifying the fourth reference text to determine whether they were fully confident in the modification result at the time. If there is any hesitation or uncertainty, the questionable correction result is marked as questionable. If it is an automated proofreading system, the system can mark the questionable correction result of the corresponding fourth reference text as questionable based on some built-in judgment logic, such as for certain special types of text modifications, or situations where the correlation between the modified text and surrounding text does not conform to conventional logic.

[0154] As an optional but not limited implementation, the determination of the corrected questionable result for each fourth reference text includes the following steps F1-F2:

[0155] Step F1: Determine the text attribute data of each fourth reference character. The text attribute data of the fourth reference character includes the reference character image to which the fourth reference character belongs, the fourth reference character, and the text output corresponding to the fourth reference character after proofreading. The reference character image to which the fourth reference character belongs is the text image to which the fourth reference character is to be input by optical character recognition.

[0156] Step F2: Determine the correction and doubt results for each fourth reference character based on the text attribute data of each fourth reference character.

[0157] Among them, the text attribute data of the fourth reference text is a set of information used to comprehensively describe the text features and related situations associated with the fourth reference text. It mainly covers three key parts: the reference text image to which the fourth reference text belongs, the fourth reference text, and the text output corresponding to the fourth reference text after proofreading.

[0158] The reference text image to which the fourth reference text belongs can refer to the original text image that initially contained the fourth reference text and from which the fourth reference text was extracted using Optical Character Recognition (OCR) technology. The text output after the fourth reference text has been proofread can be the text output after each fourth reference text has been proofread and modified following the execution of the target proofreading task; this text is the proofreading output text.

[0159] As an optional but not limited implementation, the correction and doubt results for each fourth reference character are determined based on the text attribute data of each fourth reference character, including the following steps H1-H2:

[0160] Step H1: Determine the third similarity and fourth similarity associated with each fourth reference character. The third similarity is determined based on the image similarity between each fourth reference character and the reference character image to which the fourth reference character belongs. The fourth similarity is determined based on the image similarity between the text output after proofreading of each fourth reference character and the reference character image to which the fourth reference character belongs.

[0161] Step H2: If the third similarity is greater than the fourth similarity or the fourth similarity is less than the preset similarity threshold, then it is determined that there is a question about the correctness of the text correction when performing the target proofreading task on the fourth reference text.

[0162] In an optional example, determining the third similarity of each fourth reference character includes: determining the glyphs of each font associated with each fourth reference character, and determining the font images associated with each font based on the glyphs; determining the fifth similarity between the reference character image to which the fourth reference character belongs and the font images associated with each font; and determining the third similarity of each fourth reference character based on the fifth similarity. For specific implementation logic, please refer to... Figure 6 The process of calculating the similarity of the third reference text.

[0163] In an optional example, determining the fourth similarity of each fourth reference character includes: determining the glyphs of each font associated with the text output after proofreading for each fourth reference character, and determining the font images associated with each font associated with the text output after proofreading for each fourth reference character based on the glyphs of each font associated with the text output after proofreading for each fourth reference character; determining the sixth similarity between the reference text image to which the fourth reference character belongs and the font images associated with each font associated with the text output after proofreading for each fourth reference character; and determining the fourth similarity of each fourth reference character based on the sixth similarity. For specific implementation logic, please refer to... Figure 6 The process of calculating the similarity of the third reference text.

[0164] S830. Determine the third attribute information in the reference attribute information of the target proofreading task based on the correction and doubt results of each fourth reference text.

[0165] The reference attribute information includes at least one of a first attribute information, a second attribute information, and a third attribute information. The first attribute information is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is a text formed by performing error pre-setting on the first reference text. The second attribute information is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute information is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are the texts in the target text recognition results obtained by performing optical character recognition corresponding to the target proofreading task.

[0166] In an optional example, see Figure 9 After obtaining the correction and questionable results for each fourth reference text, these results need to be synthesized to determine the third attribute information. A common approach is to statistically analyze the proportion of questionable fourth reference texts out of the total number of fourth reference texts, thereby quantifying the degree of correction and questionable issues in the entire proofreading task. The third attribute information can intuitively reflect the proportion of texts with correction uncertainties during the proofreading process, thus reflecting the quality of the proofreading task from one perspective. Other more complex comprehensive analysis methods can also be used, such as assigning different weights to different types of questionable issues before conducting statistical analysis, depending on the specific application needs and evaluation criteria.

[0167] For example, see Figure 9, traverse all the fourth reference texts, obtain the correction doubt results of each fourth reference text. Assume the number of fourth reference texts is m2, and among them, the number of fourth reference texts with correction doubts is n2. Then, according to The calculation formula formed can obtain the correction doubt rate described by the third attribute information. For example, there are 10 fourth reference texts, which are "husband", "too", "enter", "eye", "sun", "tree", "rice", "big", "soil", "worker" in sequence, and their corresponding correction doubt results are in doubt, not in doubt, in doubt, not in doubt, in doubt, not in doubt, in doubt, not in doubt, in doubt, not in doubt respectively. Classify and count these correction doubt results to determine the number of fourth reference texts in doubt and not in doubt. In the above example, there are 6 fourth reference texts in doubt and 4 not in doubt. The third attribute information can be obtained by calculating a ratio value with the number of fourth reference texts in doubt divided by the total number of fourth reference texts according to the statistical results, which reflects that there is a large uncertainty in the certainty of text correction in the proofreading task. Other calculation methods can also be used, such as calculating the weighted sum after assigning different weights to different degrees of doubt, etc.

[0168] By comprehensively analyzing the correction doubt results of each fourth reference text to determine the third attribute information, the originally scattered correction doubt situations for individual texts can be integrated and quantified. The result of this quantification (such as the above ratio value) can intuitively reflect the degree of doubt in text correction for the entire proofreading task. The third attribute information can help locate the weak links and potential risk points in the proofreading task. If the third attribute information shows a relatively high doubt ratio, then the characteristics of the fourth reference texts in doubt can be further analyzed, such as whether they are concentrated in certain specific types of texts (such as technical terms, rare characters, etc.), specific text paragraphs (such as parts involving complex logical relationships or professional knowledge explanations), or within the processing scope of specific proofreaders, so as to take targeted measures.

[0169] S840. Perform quality detection on the target proofreading task based on the reference attribute information.

[0170] The technical solution disclosed herein, based on achieving the aforementioned technical effects, clearly identifies the text actually modified during the proofreading process by determining the fourth reference text. This provides a clear scope for subsequent in-depth analysis of the rationality and accuracy of the proofreading actions. Without such screening, it is difficult to distinguish which texts changed during proofreading, making it impossible to specifically investigate whether these modifications are correct or raise questions. Determining the correction and doubt results for each fourth reference text is equivalent to examining the reliability of each proofreading modification at a micro-level. This greatly enriches the dimensions of proofreading quality assessment, enabling more precise identification of potential and uncertain modifications during the proofreading process. It avoids overlooking some potentially erroneous modifications due to a lack of careful attention, thus improving the accuracy of proofreading quality control. By clearly defining the correction and doubt results, it is possible to promptly identify text modifications that are questionable but may have been overlooked during the proofreading process; these points of doubt often represent the sources of proofreading risk. Determining the third attribute information based on the correction and doubt results of each fourth reference text achieves a comprehensive quantitative representation of the correction and doubt situations in the proofreading task.

[0171] Figure 10 This is a schematic diagram of a proofreading task detection device provided in an embodiment of the present disclosure. The present disclosure is applicable to the detection of proofreading quality of manual proofreading tasks performed after OCR text recognition. The proofreading task detection device can be implemented in the form of software and / or hardware, and is generally integrated on any electronic device with network communication function, such as a mobile terminal, PC, or server.

[0172] like Figure 10 As shown, the proofreading task detection apparatus of this disclosure embodiment may include the following:

[0173] The determining module 1010 is used to determine reference attribute information for the target proofreading task. The reference attribute information includes at least one of a first attribute, a second attribute, and a third attribute. The first attribute is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is formed by performing error pre-setting on a first reference text. The second attribute is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are texts in the target text recognition result obtained by performing optical character recognition corresponding to the target proofreading task.

[0174] The detection module 1020 is used to perform proofreading task quality detection on the target proofreading task based on the reference attribute information.

[0175] Based on the above embodiments, optionally, the reference attribute information for the target proofreading task is determined, including:

[0176] Determine the preset error set and the proofreading output set corresponding to the target proofreading task. The preset error set includes at least one preset error text for performing the target proofreading task. The proofreading output set includes the proofreading output text corresponding to each preset error text after the target proofreading task is performed on at least one preset error text in the preset error set.

[0177] The first attribute information in the reference attribute information of the target proofreading task is determined based on the preset error set and the proofreading output set. The first attribute information is used to indicate the text correction rate for each preset error character in the preset error set when the target proofreading task is executed to correct the text of each preset error character included in the preset error set to generate the proofreading output set.

[0178] Based on the above embodiments, optionally, determining the preset error set and proofreading output set corresponding to the target proofreading task includes:

[0179] At least one first reference character is obtained from the target character recognition result corresponding to the target proofreading task, and the optical character recognition confidence level corresponding to the first reference character is greater than the preset recognition confidence level.

[0180] By performing similar-looking character replacement on each of the at least one first reference characters, a preset error character associated with each first reference character is obtained, so as to obtain the preset error set.

[0181] Based on the above embodiments, optionally, the replacement of each of the first reference characters included in the at least one first reference character with similar-looking characters includes:

[0182] For each first reference character, obtain a list of similar characters for the first reference character from the similar character mapping table. The characters included in the list of similar characters for the first reference character satisfy a preset character shape similarity with the first reference character.

[0183] Replace the first reference text with the text at the random index position in the list of similar-looking characters.

[0184] Based on the above embodiments, optionally, after performing similar-looking character replacement on each of the at least one first reference character to obtain a preset error character associated with each first reference character, the method further includes:

[0185] Each first reference character is obtained by replacing it with a similar-looking character and then binding a preset text attribute to the preset error text corresponding to each first reference character. The preset text attribute is used to filter and extract each preset error text that participates in the text correction rate evaluation from at least one preset error text corresponding to the target proofreading task.

[0186] Based on the above embodiments, optionally, the first attribute information in the reference attribute information of the target proofreading task is determined based on the preset error set and the proofreading output set, including:

[0187] The preset error text in the preset error set is compared with the corresponding output text of the preset error text in the proofreading output set after proofreading.

[0188] Based on the text comparison results between the preset error text and the corresponding output text after the preset error text is proofread, the text correction status of each preset error text in the preset error set is determined. The text correction status of the preset error set is used to describe whether the preset error text in the preset error set is corrected when the target proofreading task is performed on the preset error set to generate the proofreading output set.

[0189] The first attribute information in the reference attribute information of the target proofreading task is determined based on the text correction status of each preset error text in the preset error set.

[0190] Based on the above embodiments, optionally, the text correction status of each preset error character in the preset error set is determined according to the text comparison result between the preset error character and the corresponding output text after proofreading, including:

[0191] If the preset error text is the same as the corresponding proofreading output text after the preset error text is proofread, then it is determined that the preset error text in the preset error set was not corrected when the target proofreading task was performed on the preset error set to generate the proofreading output set.

[0192] If the preset error text is different from the corresponding proofreading output text after the preset error text is proofread, it is determined that the preset error text in the preset error set has been corrected when the target proofreading task is executed on the preset error set to generate the proofreading output set.

[0193] Based on the above embodiments, optionally, the reference attribute information for the target proofreading task is determined, including:

[0194] At least one third reference character is determined from a plurality of second reference characters, each of the third reference characters satisfying the following condition: after the target verification task is performed, the third reference character is the same as the corresponding output character after the third reference character is verified, and the plurality of second reference characters include all characters in the target character recognition result corresponding to the target verification task or the remaining characters in the target character recognition result corresponding to the target verification task except for the first reference character;

[0195] Determine the optical character recognition error probability associated with each of the third reference characters, wherein the optical character recognition error probability of the third reference character is used to indicate the probability of an optical character recognition error occurring when the third reference character is obtained by optical character recognition of the text image;

[0196] Based on the optical character recognition error probability associated with each of the third reference characters, the second attribute information in the reference attribute information of the target proofreading task is determined.

[0197] Based on the above embodiments, optionally, determining the optical character recognition error probability associated with each of the third reference characters includes:

[0198] The text attribute data of each third reference character is determined. The text attribute data of the third reference character includes the reference character image to which the third reference character belongs, the third reference character, and the optical character recognition confidence of the third reference character. The reference character image to which the third reference character belongs is a text image that can be obtained by optical character recognition to input the third reference character.

[0199] Based on the text attribute data of each of the third reference characters, determine the optical character recognition error probability associated with each of the third reference characters.

[0200] Based on the above embodiments, optionally, the optical character recognition error probability associated with each of the third reference characters is determined according to the text attribute data of each of the third reference characters, including:

[0201] A first similarity is determined for each of the third reference characters, the first similarity being determined based on the image similarity between each of the third reference characters and the reference character image to which the third reference character belongs;

[0202] The optical character recognition error probability associated with each of the third reference characters is determined based on the first similarity associated with each of the third reference characters and the optical character recognition confidence of each of the third reference characters.

[0203] Based on the above embodiments, optionally, determining the first similarity of each of the third reference text associations includes:

[0204] Determine the font glyphs associated with each of the third reference characters, and determine the font text images associated with each of the third reference characters based on the font glyphs associated with each of the third reference characters;

[0205] Determine the second similarity between the reference text image to which the third reference text belongs and each font text image associated with the third reference text;

[0206] The first similarity of each third reference text is determined based on the second similarity.

[0207] Based on the above embodiments, optionally, the optical character recognition error probability associated with each third reference character is determined according to the first similarity associated with each third reference character and the optical character recognition confidence level of each third reference character, including:

[0208] The first similarity associated with each of the third reference characters and the optical character recognition confidence of each of the third reference characters are input into the character recognition error prediction model. The character recognition error prediction model is trained based on a binary classification logistic regression model and is used to determine whether the characters output by optical character recognition of text images are correct.

[0209] The probability of optical character recognition error associated with each of the third reference characters is determined by a character recognition error prediction model.

[0210] Based on the above embodiments, optionally, the reference attribute information for the target proofreading task is determined, including:

[0211] At least one fourth reference character is determined from a plurality of second reference characters, each of the fourth reference characters satisfying the following condition: after the target verification task is performed, the fourth reference character is not the same as the corresponding output character after the fourth reference character is verified, and the plurality of second reference characters include all characters in the target character recognition result corresponding to the target verification task or the remaining characters in the target character recognition result corresponding to the target verification task except for the first reference character;

[0212] Determine the correction doubt result for each of the fourth reference characters, the correction doubt result being used to indicate whether there is any doubt about the correctness of the correction when performing text correction on the fourth reference character in the target proofreading task;

[0213] The third attribute information in the reference attribute information of the target proofreading task is determined based on the correction and doubt results of each of the fourth reference texts.

[0214] Based on the above embodiments, optionally, the correction result for each of the fourth reference characters is determined, including:

[0215] The text attribute data of each fourth reference character is determined. The text attribute data of the fourth reference character includes the reference character image to which the fourth reference character belongs, the fourth reference character, and the text output corresponding to the fourth reference character after proofreading. The reference character image to which the fourth reference character belongs is a text image that can be obtained by optical character recognition to input the fourth reference character.

[0216] The correction and doubt results for each of the fourth reference characters are determined based on the text attribute data of each of the fourth reference characters.

[0217] Based on the above embodiments, optionally, the correction and doubt results for each fourth reference character are determined according to the text attribute data of each fourth reference character, including:

[0218] A third similarity and a fourth similarity are determined for each of the fourth reference characters. The third similarity is determined based on the image similarity between each of the fourth reference characters and the reference character image to which the fourth reference character belongs. The fourth similarity is determined based on the image similarity between the text output after the fourth reference character is proofread and the reference character image to which the fourth reference character belongs.

[0219] If the third similarity is greater than the fourth similarity or the fourth similarity is less than the preset similarity threshold, then it is determined that there is a question about the correctness of the text correction when performing the target proofreading task on the fourth reference text.

[0220] The technical solution of this disclosure, by calculating a first attribute information based on determining a preset erroneous text containing errors and a proofread output text containing text output after a target proofreading task is performed on the preset erroneous text, helps to accurately quantify the ability to detect and correct errors during the proofreading process. This allows for targeted improvements to the proofreading task, increasing the accuracy of text correction and avoiding the omission of erroneous text that should have been corrected. For example, a high text correction rate indicates that the proofreading task is effective in correcting errors, while a low rate suggests that the proofreading may be careless or lacking in capability. The second attribute information is used to indicate the omission of text correction when performing a target proofreading task on multiple second reference texts, helping to clarify whether there are any undetected and uncorrected errors in the entire proofreading task. This approach aims to reduce the occurrence of uncorrected errors during the actual proofreading process, thereby improving the completeness and accuracy of the proofreading results. Especially in practical applications, overlooking critical errors can lead to serious consequences. The aforementioned method allows for precise identification of these omissions, enabling timely supplementary proofreading and improvement, significantly enhancing the completeness and reliability of the proofreading. The third attribute information indicates any questionable text corrections when performing text correction on multiple second reference texts, improving the accuracy of the proofreading task. It avoids retaining questionable but actually erroneous text as correct content, or mistakenly treating correct text as incorrect and making unnecessary changes. This provides a clear picture of the handling of uncertain text during the proofreading process, enabling assessment of whether uncertain factors are handled appropriately and preventing improper handling of questionable text from affecting the quality and credibility of the correction. By utilizing at least one of the above attribute information, the quality detection process for the proofreading task can be automated and systematically completed. This allows for targeted optimization and improvement of the proofreading task, continuously enhancing the performance and accuracy of the entire text recognition and proofreading system and reducing the error rate.

[0221] The proofreading task detection device provided in this disclosure can execute the proofreading task detection method provided in any embodiment of this disclosure, and has the corresponding functional modules and beneficial effects for executing the proofreading task detection method.

[0222] It is worth noting that the various units and modules included in the above-mentioned device are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, the specific names of each functional unit are only for easy differentiation and are not used to limit the protection scope of the embodiments of this disclosure.

[0223] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Reference is made below. Figure 11It illustrates an electronic device suitable for implementing embodiments of the present disclosure (e.g., Figure 11 The diagram below shows the structure of the terminal device or server 1200. The terminal device in this embodiment may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and vehicle terminals (e.g., vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 11 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0224] like Figure 11 As shown, the electronic device 1200 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203. The RAM 1203 also stores various programs and data required for the operation of the electronic device 1200. The processing unit 1201, ROM 1202, and RAM 1203 are interconnected via a bus 1204. An edit / output (I / O) interface 1205 is also connected to the bus 1204.

[0225] Typically, the following devices can be connected to I / O interface 1205: input devices 1206 including, for example, a touchscreen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1207 including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices 1208 including, for example, magnetic tape, hard disk, etc.; and communication devices 1209. Communication device 1209 allows electronic device 1200 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 11 An electronic device 1200 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0226] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 1209, or installed from storage device 1208, or installed from ROM 1202. When the computer program is executed by processing device 1201, it performs the functions defined in the methods of embodiments of this disclosure.

[0227] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0228] The electronic device provided in this embodiment and the calibration task detection method provided in the above embodiments belong to the same inventive concept. Technical details not described in detail in this embodiment can be found in the above embodiments, and this embodiment has the same beneficial effects as the above embodiments.

[0229] This disclosure provides a computer storage medium storing a computer program that, when executed by a processor, implements the calibration task detection method provided in the above embodiments.

[0230] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0231] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.

[0232] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0233] The aforementioned computer-readable medium carries one or more programs. When the aforementioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: determine reference attribute information for a target proofreading task, the reference attribute information including at least one of first attribute information, second attribute information, and third attribute information; the first attribute information is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task, each preset erroneous text being formed by erroneously presetting a first reference text; the second attribute information is used to indicate the text correction omissions when performing text correction on multiple second reference texts; the third attribute information is used to indicate the text correction doubts when performing text correction on multiple second reference texts; the first reference text and the second reference text are texts in the target text recognition results obtained by optical character recognition corresponding to the target proofreading task; and perform proofreading task quality detection on the target proofreading task based on the reference attribute information.

[0234] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including but not limited to object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0235] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0236] The units described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".

[0237] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0238] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0239] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0240] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0241] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. A proofreading task detection method, characterized in that, The method includes: The reference attribute information for the target proofreading task is determined. The reference attribute information includes at least one of a first attribute, a second attribute, and a third attribute. The first attribute is used to indicate the text correction rate when the target proofreading task is performed on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is formed by erroneously presetting a first reference text. The second attribute is used to indicate the text correction omission when the target proofreading task is performed on multiple second reference texts. The third attribute is used to indicate the text correction doubt when the target proofreading task is performed on multiple second reference texts. The first reference text and the second reference text are the text in the target text recognition result obtained by optical character recognition corresponding to the target proofreading task. The proofreading task quality is checked based on the reference attribute information.

2. The method according to claim 1, characterized in that, Determine the reference attribute information for the target proofreading task, including: Determine the preset error set and the proofreading output set corresponding to the target proofreading task. The preset error set includes at least one preset error text for performing the target proofreading task. The proofreading output set includes the proofreading output text corresponding to each preset error text after the target proofreading task is performed on at least one preset error text in the preset error set. The first attribute information in the reference attribute information of the target proofreading task is determined based on the preset error set and the proofreading output set. The first attribute information is used to indicate the text correction rate for each preset error character in the preset error set when the target proofreading task is executed to correct the text of each preset error character included in the preset error set to generate the proofreading output set.

3. The method according to claim 1, characterized in that, Determine the preset error set and proofreading output set corresponding to the target proofreading task, including: At least one first reference character is obtained from the target character recognition result corresponding to the target proofreading task, and the optical character recognition confidence level corresponding to the first reference character is greater than the preset recognition confidence level. By performing similar-looking character replacement on each of the at least one first reference characters, a preset error character associated with each first reference character is obtained, so as to obtain the preset error set.

4. The method according to claim 2, characterized in that, The process of replacing each of the first reference characters included in the at least one first reference character with a similar-looking character includes: For each first reference character, obtain a list of similar characters for the first reference character from the similar character mapping table. The characters included in the list of similar characters for the first reference character satisfy a preset character shape similarity with the first reference character. Replace the first reference text with the text at the random index position in the list of similar-looking characters.

5. The method according to claim 2, characterized in that, After performing similar-looking character replacement on each of the at least one first reference characters to obtain a preset error character associated with each first reference character, the method further includes: Each first reference character is obtained by replacing it with a similar-looking character and then binding a preset text attribute to the preset error text corresponding to each first reference character. The preset text attribute is used to filter and extract each preset error text that participates in the text correction rate evaluation from at least one preset error text corresponding to the target proofreading task.

6. The method according to claim 2, characterized in that, The first attribute information in the reference attribute information for determining the target proofreading task based on the preset error set and the proofreading output set includes: The preset error text in the preset error set is compared with the corresponding proof output text in the proof output set after proofreading. Based on the text comparison results between the preset error text and the corresponding output text after the preset error text is proofread, the text correction status of each preset error text in the preset error set is determined. The text correction status of the preset error set is used to describe whether the preset error text in the preset error set is corrected when the target proofreading task is performed on the preset error set to generate the proofreading output set. The first attribute information in the reference attribute information of the target proofreading task is determined based on the text correction status of each preset error text in the preset error set.

7. The method according to claim 6, characterized in that, Based on the text comparison results between the preset erroneous text and the corresponding proofread output text after proofreading, the text correction status of each preset erroneous text in the preset error set is determined, including: If the preset error text is the same as the corresponding proofreading output text after the preset error text is proofread, then it is determined that the preset error text in the preset error set was not corrected when the target proofreading task was performed on the preset error set to generate the proofreading output set. If the preset error text is different from the corresponding proofreading output text after the preset error text is proofread, it is determined that the preset error text in the preset error set has been corrected when the target proofreading task is executed on the preset error set to generate the proofreading output set.

8. The method according to claim 1, characterized in that, Determine the reference attribute information for the target proofreading task, including: At least one third reference character is determined from a plurality of second reference characters, each of the third reference characters satisfying the following condition: after the target verification task is performed, the third reference character is the same as the corresponding output character after the third reference character is verified, and the plurality of second reference characters include all characters in the target character recognition result corresponding to the target verification task or the remaining characters in the target character recognition result corresponding to the target verification task except for the first reference character; Determine the optical character recognition error probability associated with each of the third reference characters, wherein the optical character recognition error probability of the third reference character is used to indicate the probability of an optical character recognition error occurring when the third reference character is obtained by optical character recognition of the text image; Based on the optical character recognition error probability associated with each of the third reference characters, the second attribute information in the reference attribute information of the target proofreading task is determined.

9. The method according to claim 8, characterized in that, Determining the optical character recognition error probability associated with each of the third reference characters includes: The text attribute data of each third reference character is determined. The text attribute data of the third reference character includes the reference character image to which the third reference character belongs, the third reference character, and the optical character recognition confidence of the third reference character. The reference character image to which the third reference character belongs is a text image that can be obtained by optical character recognition to input the third reference character. Based on the text attribute data of each of the third reference characters, determine the optical character recognition error probability associated with each of the third reference characters.

10. The method according to claim 9, characterized in that, Based on the text attribute data of each of the third reference characters, the optical character recognition error probability associated with each of the third reference characters is determined, including: A first similarity is determined for each of the third reference characters, the first similarity being determined based on the image similarity between each of the third reference characters and the reference character image to which the third reference character belongs; The optical character recognition error probability associated with each of the third reference characters is determined based on the first similarity associated with each of the third reference characters and the optical character recognition confidence of each of the third reference characters.

11. The method according to claim 10, characterized in that, Determining the first similarity of each of the third reference text associations includes: Determine the font glyphs associated with each of the third reference characters, and determine the font text images associated with each of the third reference characters based on the font glyphs associated with each of the third reference characters; Determine the second similarity between the reference text image to which the third reference text belongs and each font text image associated with the third reference text; The first similarity of each third reference text is determined based on the second similarity.

12. The method according to claim 10, characterized in that, Based on the first similarity associated with each of the third reference characters and the optical character recognition confidence level of each of the third reference characters, the optical character recognition error probability associated with each of the third reference characters is determined, including: The first similarity associated with each of the third reference characters and the optical character recognition confidence of each of the third reference characters are input into the character recognition error prediction model. The character recognition error prediction model is trained based on a binary classification logistic regression model and is used to determine whether the characters output by optical character recognition of text images are correct. The probability of optical character recognition error associated with each of the third reference characters is determined by a character recognition error prediction model.

13. The method according to claim 1, characterized in that, Determine the reference attribute information for the target proofreading task, including: At least one fourth reference character is determined from a plurality of second reference characters, each of the fourth reference characters satisfying the following condition: after the target verification task is performed, the fourth reference character is not the same as the corresponding output character after the fourth reference character is verified, and the plurality of second reference characters include all characters in the target character recognition result corresponding to the target verification task or the remaining characters in the target character recognition result corresponding to the target verification task except for the first reference character; Determine the correction doubt result for each of the fourth reference characters, the correction doubt result being used to indicate whether there is any doubt about the correctness of the correction when performing text correction on the fourth reference character in the target proofreading task; The third attribute information in the reference attribute information of the target proofreading task is determined based on the correction and doubt results of each of the fourth reference texts.

14. The method according to claim 13, characterized in that, Determine the rectification doubts for each of the fourth reference characters, including: The text attribute data of each fourth reference character is determined. The text attribute data of the fourth reference character includes the reference character image to which the fourth reference character belongs, the fourth reference character, and the text output corresponding to the fourth reference character after proofreading. The reference character image to which the fourth reference character belongs is a text image that can be obtained by optical character recognition to input the fourth reference character. The correction and doubt results for each of the fourth reference characters are determined based on the text attribute data of each of the fourth reference characters.

15. The method according to claim 14, characterized in that, The correction and doubt results for each of the fourth reference characters are determined based on the text attribute data of each fourth reference character, including: A third similarity and a fourth similarity are determined for each of the fourth reference characters. The third similarity is determined based on the image similarity between each of the fourth reference characters and the reference character image to which the fourth reference character belongs. The fourth similarity is determined based on the image similarity between the text output after the fourth reference character is proofread and the reference character image to which the fourth reference character belongs. If the third similarity is greater than the fourth similarity or the fourth similarity is less than the preset similarity threshold, then it is determined that there is a question about the correctness of the text correction when performing the target proofreading task on the fourth reference text.

16. A calibration task detection device, characterized in that, The device includes: A determination module is used to determine reference attribute information for a target proofreading task. The reference attribute information includes at least one of a first attribute, a second attribute, and a third attribute. The first attribute is used to indicate the text correction rate when performing text correction on at least one preset erroneous text corresponding to the target proofreading task. Each preset erroneous text is formed by erroneously presetting a first reference text. The second attribute is used to indicate the text correction omission when performing text correction on multiple second reference texts. The third attribute is used to indicate the text correction doubt when performing text correction on multiple second reference texts. The first reference text and the second reference text are texts in the target text recognition result obtained by optical character recognition corresponding to the target proofreading task. The detection module is used to perform proofreading task quality detection on the target proofreading task based on the reference attribute information.

17. An electronic device, characterized in that, The electronic device includes: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the proofreading task detection method as described in any one of claims 1-15.

18. A storage medium containing computer-executable instructions, characterized in that, The computer-executable instructions, when executed by a computer processor, are used to perform the proofreading task detection method as described in any one of claims 1-15.